Conference PaperPDF Available

Can we build language-independent OCR using LSTM networks?

Authors:

Abstract and Figures

Language models or recognition dictionaries are usually con-sidered an essential step in OCR. However, using a lan-guage model complicates training of OCR systems, and it also narrows the range of texts that an OCR system can be used with. Recent results have shown that Long Short-Term Memory (LSTM) based OCR yields low error rates even without language modeling. In this paper, we explore the question to what extent LSTM models can be used for multilingual OCR without the use of language models. To do this, we measure cross-language performance of LSTM models trained on different languages. LSTM models show good promise to be used for language-independent OCR. The recognition errors are very low (around 1%) without using any language model or dictionary correction.
Content may be subject to copyright.
Can we build language-independent OCR
using LSTM networks?
Adnan Ul-Hasan
Technical University of Kaiserslautern
67663 Kaiserslautern, Germany
adnan@cs.uni-kl.de
Thomas M. Breuel
Technical University of Kaiserslautern
67663 Kaiserslautern, Germany
tmb@cs.uni-kl.de
ABSTRACT
Language models or recognition dictionaries are usually con-
sidered an essential step in OCR. However, using a lan-
guage model complicates training of OCR systems, and it
also narrows the range of texts that an OCR system can
be used with. Recent results have shown that Long Short-
Term Memory (LSTM) based OCR yields low error rates
even without language modeling. In this paper, we explore
the question to what extent LSTM models can be used for
multilingual OCR without the use of language models. To
do this, we measure cross-language performance of LSTM
models trained on different languages. LSTM models show
good promise to be used for language-independent OCR.
The recognition errors are very low (around 1%) without
using any language model or dictionary correction.
Keywords
MOCR, LSTM Networks, RNN
1. INTRODUCTION
Multilingual OCR (MOCR) is of interest for many rea-
sons; digitizing historic books containing two or more scripts,
bilingual books, dictionaries, and books with line by line
translation are few reasons to have reliable multilingual OCR
systems. However, it (MOCR) also present several unique
challenges as Popat pointed out in context of Google books
project1. Some of the unique challenges are:
Multiple scripts/languages on a page. (multi-sript iden-
tification)
Multiple languages in same or similar fonts, like Arabic-
Persian, English-German.
The same language in multiple scripts, like Urdu in
Nastaleeq and Naskh scripts.
Archaic and reformed orthographies, e.g. 18th Century
English, Fraktur (historical German), etc.
1http://en.wikipedia.org/wiki/Google Books
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Request permissions from Permissions@acm.org.
MOCR ’13, August 24 2013, Washington, DC, USA
Copyright 2013 ACM 978-1-4503-2114-3/13/08 ...$15.00.
http://dx.doi.org/10.1145/2505377.2505394
There have been efforts reported to adapt the existing
OCR systems for other languages. Open source OCR sys-
tem Tesseract [2] is one such example. The basic classifica-
tion is based on hierarchical shape-classification, where at
first the character set is reduced to few characters and then
at last stage, the test sample is matched against the repre-
sentative of the short set. Although, Tesseract can be used
for a variety of languages (due to support available for many
languages), it can not be used as an all-in-one solution for
situation where we have multiple scripts together.
The usual approach to address multilingual OCR problem
is to somehow combine two or more separate classifiers [3],
as it is believed that a reasonable OCR output for a sin-
gle script can not be obtained without sophisticated post-
processing steps such as language modelling, use of dictio-
nary to correct OCR errors, font adaptation, etc. Natarajan
et al. [4] proposed an HMM-based script-independent multi-
lingual OCR system. Feature extraction, training and recog-
nition components are all language independent; however,
they use language specific word lexicon and language models
for recognition purpose. To our best knowledge, there was
not a single method proposed for OCR, that can achieve
very low error rates without using aforementioned sophis-
ticated post-processing techniques. But recent experiments
on English and German script using LSTM networks [5] have
shown that reliable OCR results can be obtained without
such techniques.
Our hypothesis for multilingual OCR is that if a single
model, at least for a family of scripts, e.g. Latin, Arabic,
Indic can be obtained, we can then use this single model to
recognize scripts of that particular family; thereby reduc-
ing the efforts to combine multiple classifiers. Since LSTM
networks can achieve very low error-rates without using lan-
guage modelling post-processing step; they can be used for
multilingual OCR.
In this paper, we report the results of applying LSTM
networks to address multilingual OCR problem. The ba-
sic aim is to benchmark how LSTM networks use language
modelling to predict the correct labels or can we do better
without using language modelling and other post-processing
steps. Additionally, we also want to see how well LSTM
networks use context to recognize a particular character.
Specifically, we trained LSTM networks for English, Ger-
man, French and a mix set of these three languages and test
them on each other. LSTM network based models achieve
very high recognition accuracy without the aid of language
modelling and they have shown good promise to be used for
multilingual OCR tasks.
Figure 1: Some sample images from our database. There are 96 variations in standard fonts used in common
practice, e.g. for times true-type fonts; its normal, italic, bold and italic-bold variations were included. Also,
note that these images were degraded to reflect scanning artefacts.
In what follows, preprocessing step is reported in next
section, Section 3 describes the configuration of the LSTM
network used in the experiments, Section 4 gives the details
of experimental evaluation. Section 5 concludes the paper
with discussions on the current work and future directions.
2. PREPROCESSING
Scale and relative position of a character are important
features to distinguish characters in Latin script (and some
other scripts). So, text line normalization is an essential step
in applying 1D LSTM networks to OCR. In this work, we
used the normalization approach introduced in [5], namely
text-line normalization based on a trainable, shape-based
model. A token dictionary created from a collection of a
bunch of text lines contains information about x-height,
baseline (geometric features) and shape of individual charac-
ters. These models are then used to normalize any text-line
image.
3. LSTM NETWORKS
Recurrent Neural Networks (RNNs) have shown a greate
promise in recent times due to the Long Short Term Mem-
ory (LSTM) architecture [6], [7]. The LSTM architecture
differs significantly from earlier architectures like Elman net-
works [8] and echo-state networks [9]; and appears to over-
come many of the limitations and problems of those earlier
architectures.
Traditinoal RNNs, though are good at context-aware pro-
cessing [10], have not shown vying performance for OCR and
speech recognition tasks. Their incompetence is reported
mainly due to the vanishing gradient problem [11, 12]. The
Long Short Term Memory [6] architecture was designed to
overcome this problem. It is a highly non-linear recurrent
network with multiplicative “gates” and additive feedback.
Graves et al. [7] introduced bidirectional LSTM architecture
for accessing context in both forward and backward direc-
tions. Both layers are then connected to a single output
layer. To avoid the requirement of segmented training data,
Graves et al. [13] used a forward backward algorithm to align
transcripts with the output of the neural network. Interested
reader is suggested to see the above-mentioned references for
further details regarded LSTM and RNN architectures.
For recognition, we used a 1D bidirectional LSTM archi-
tecture, as described in [7]. We found that 1D architec-
ture outperforms their 2D or higher dimensional siblings for
printed OCR tasks. For all the experiments reported in this
paper, we used a modified version of the LSTM library de-
scribed in [14]. That library provides 1D and multidimen-
sional LSTM networks, together with ground-truth align-
ment using a forward-backward algorithm (“CTC”, connec-
tionist temporal classification; [13]). The library also pro-
vides a heuristic decoding mechanism to map the frame-wise
network output onto a sequence of symbols. We have reim-
plemented LSTM networks and forward-backward alignment
from scratch and reproduced these results (our implementa-
tion uses a slightly different decoding mechanism). This im-
plementation has been released as an open-source form [15]
(ocropus version 0.7 ).
During the training stage, randomly chosen input text-
line images are presented as 1D sequences to forward prop-
agation step through LSTM cells and then the forward-
backward alignment of the output is performed. Errors are
then back-propagated to update weights and the process is
then repeated for the next randomly selected text-line im-
age. It is to be noted that raw pixel values are being used
as the only features and other sophisticated features were
extracted from the text-line images. The implicit features
in 1D sequence are baseline and x-heights of individual char-
acters.
4. EXPERIMENTAL EVALUATION
The aim of our experiments was to evaluate LSTM per-
formance on multilingual OCR without the aid of language
modelling and other language-specific assistance. To explore
the cross-language performance of LSTM networks, a num-
ber of experiments were performed. We trained four sep-
arate LSTM networks for English, German, French and a
mixed set of all these languages. For testing, we have a to-
tal of 16 permutation. Each LSTM network was tested on
Table 1: Statistics on number of text-line images
in each of English, French, German and mix-script
datasets.
Language Total Training Test
English 85,350 81,600 4750
French 85,350 81,600 4750
German 1,14,749 1,10,400 4349
Mixed-script 85,350 81,600 4750
Table 2: Experimental results of applying LSTM networks for multilingual OCR. These results validate our
hypothesis that a single LSTM model trained with a mixture of scripts (from a single family of script) can
be used to recognize text of individual family members. Note that the error rates of testing LSTM network
trained for German on French and networks trained for English on French and German were obtained by
ignoring the words containing special characters (umlauts and accented letters) to correctly gauge the affect
of language models of a particular language. LSTM networks trained for individual languages can also be
used to recognize other scripts, but they show some language dependence. All these results were achieved
without the aid of any language model.
XXXXXXXX
X
Script
Model English (%) German (%) French (%) Mixed (%)
English 0.5 1.22 4.11.06
German 2.04 0.85 4.71.2
French 1.8 1.4 1.11.05
Mixed-script 1.7 1.1 2.9 1.1
the respective script and on other three scripts, e.g. test-
ing LSTM network trained on German on German, French,
English and mixed-script. These results are summarized in
Table 2, and some sample outputs are presented in Table 3.
As error measure, we used the ratio of insertions, deletions
and substitution relative to the ground-truth and accuracy
was measured at character level.
4.1 Database
A separate synthetic database for each language was de-
veloped using OCRopus [16] (ocropus-linegen). This utility
requires a bunch of utf-8 encoded text files and a set of
true-type fonts. With these two things available, one can
artificially generate any number of text-line images. This
utility also provide control to induce scanning artefacts such
as distortion, jitter, and other degradations. Separate cor-
pora of text-line images in German, English and French
languages were generated in commonly used fonts (includ-
ing bold, italic, italic-bold) from freely available literature.
These images were degraded using degradation models [17]
to reflect scanning artefacts. There are four degradation pa-
rameters, namely elastic elongation, jitter, sensitivity and
threshold. Sample text-lines images in our database are
shown in Figure 1. Each database is further divided into
training and test datasets. Statistics on number of text line
images in each four scripts is given in Table 1.
4.2 Parameters
The text lines were normalized to a height of 32 in pre-
processing step. Both left-to-right and right-to-left LSTM
layers contain 100 LSTM memory blocks. The learning rate
was set to 1e4, and the momentum was set to 0.9. The
training was carried out for one million steps (roughly cor-
responding to 100 epochs, given the size of the training set).
Training errors were reported every 10,000 training steps
and plotted. The network corresponding to the minimum
training error was used for test set evaluation.
4.3 Results
Since, there are no umlauts (German) and accented (French)
letters in English, so while testing LSTM model trained for
German on French and model trained for English on French
and German, the words containing those special characters
were omitted from the recognition results. The reason to do
this was to able to correctly gauge the affect of not-using
language models. If those words were not removed, then the
resulting error would also contain a proportion of errors due
to character mis-recognition. So by removing those words
with special characters, the true performance of the LSTM
network trained for language containing lesser alphabets on
the language containing more alphabets can be evaluated.
It should be noted that these results were obtained without
the aid of any post-processing step, like language modelling,
use of dictionaries to correct OCR errors, etc.
LSTM model trained for mixed-data was able to obtain
similar recognition results (around 1% recognition error)
when applied to English, German and French script indi-
vidually. Other results indicate small language dependence
in that LSTM models trained for a single language yielded
lower error rates when tested on the same script than when
they are evaluated on other scripts.
To gauge the magnitude of affect of language modelling,
we compared our results with Tesseract open-source OCR
system [18]. We applied latest available models (as of sub-
mission date) of English, French and German on the same
test-data. Tesseract system achieved high rates as com-
pared to LSTM based models. Tesseract’s model for En-
glish yielded 1.33%, 5.02%, 5.09% and 4.82% recognition
error when applied to English, French, German and Mixed-
data respectively. Model for French yielded 2.06%, 2.7%,
3.5% and 2.96% recognition error when applied to English,
German and Mixed-data respectively, while model for Ger-
man yielded 1.85%, 2.9%, 6.63% and 4.36% recognition er-
ror when applied to English, French and Mixed-data re-
spectively. So, these results show that absence of language
modelling or applying different language models affects the
recognition. Since no model for mixed data is available for
Tesseract, the effect of evaluating such a model on individual
script could not be computed.
5. DISCUSSION AND CONCLUSIONS
The results presented in this paper show that LSTM net-
works can be used for multilingual OCR. LSTM networks
do not learn a particular language model internally (nor we
need such models as post-processing step). They show great
promise to learn various shapes of a certain character in dif-
ferent fonts and under degradations (as evident from our
highly versatile data). The language dependence is observ-
Table 3: Sample outputs from four LSTM networks trained for English, German, French and Mixed data.
LSTM net trained on a specific language is unable to recognize special characters of other languages as they
were not part of training. Therefore, it is necessary to ignore these errors from final error score. Thus we
can train an LSTM model for mix-data of a family of script and can use it to recognize individual language
of this family with very low recognition error.
Text-line Image
English
German
French
Mixed-data
Text-line Image
English
German
French
Mixed-data H
Text-line Image
English
German
French
HMixed-data
able, but the affects are small as compared to other state-
of-the-art OCR, where absence of language models results
in relatively bad results. To gauge the language dependence
more precisely, one can evaluate the performance of LSTM
by training LSTM networks on randomly generated data
using n-gram statistics and testing those models on natural
languages. Currently, we are working in this direction and
the results will be reported elsewhere.
In the following, we will analyse the errors made by our
LSTM networks when applied to other scripts. Top 5 con-
fusions for each case are tabulated in Table 4. The case of
applying an LSTM network to the same language for which
it was trained is not discussed here as it is not relevant for
the discussion of cross-language performance of LSTM net-
works.
Most of the errors caused by LSTM network trained on
mixed-data are non-recognition (deletion) of certain char-
acters like l,t,r,i. These errors may be removed by better
training.
Looking at the first column of Table 4 (Applying LSTM
network trained for English on other 3 scripts), most of the
errors are due to the confusion between characters of similar
shapes, like Ito l(and vice verca), Zto 2 and cto e. Two
confusions namely Zwith Aand Zwith Lare interesting as,
apparently, there are no shape similarity between them. One
possibility of such a behaviour may be due to the fact that
Zis the least frequent letter in English2and thus there may
be not many Zs in the training samples, thereby resulting
in its poor recognition. Two other noticeable errors (also in
other models) are unrecognised space and (denotes that
this letter was deleted).
2http://en.wikipedia.org/wiki/Letter frequency
For LSTM networks trained on German language (second
column), most of the top errors are due to inability to rec-
ognize a particular letter. Top errors when applying LSTM
network trained for French language on other scripts are con-
fusion between w/W with v/V. An interesting observation,
which could be a possible reason for such behaviour, is that
relative frequency of w(see footnote) is very low in French.
In other words, ‘w’ may be considered as a special character
w.r.t. French language when applying French model to Ger-
man and English. So, this is a language dependent issue,
which is not observable in case of mix-data.
This work can be extended in future in many directions.
First, more European languages like Italian, Spanish, Dutch
may be included in current set-up to train an all-in-one
LSTM network for these languages. Secondly, other fam-
ilies of script especially Nabataean and Indic scripts can be
tested to further validate our hypothesis empirically.
6. REFERENCES
[1] A. C. Popat, “Multilingual OCR Challenges in Google
Books,” 2012. [Online]. Available:
http://dri.ie/sites/default/files/files/popat multilingual
ocr challenges-handout.pdf
[2] R. Smith, D. Antonova, and D. S. Lee, “Adapting the
Tesseract Open Source OCR Engine for Multilingual
OCR,” in Int. Workshop on Multilingual OCR, Jul.
2009.
[3] M. A. Obaida, M. J. Hossain, M. Begum, and M. S.
Alam, “Multilingual OCR (MOCR): An Approach to
Classify Words to Languages,” Int’l Journal of
Computer Applications, vol. 32, no. 1, pp. 46–53, Oct.
2011.
Table 4: Top confusions for applying LSTM models for various tasks. The confusions for an LSTM models for
which it was trained are not mentioned as it is unnecessary for our present paper. shows the garbage class,
i.e. the character is not recognized at all. When the LSTM net trained on English was applied to recognize
other scripts, the resulting top errors are similar: shape confusions between characters. Non-recognition of
space” and “ are other noticeable errors. For network trained on German language, most errors are due
to deletion of characters. Confusion of w/W with v/V are the top confusions when LSTM network trained
on French was applied to other scripts.
XXXXXXXX
X
Script
Model English German French Mixed
English -space
c
t
0
vy
vw
vv w
space
w
lI
space
t
0
lI
l
German lI
LZ
AZ
ce
2Z
-vw
ˆu¨u
VW
space
vv w
space
t
l
i
r
French 0
space
Il
tl
I!
space
0
e
c
l
-space
i
e´e
l
0
Mixed-script 0
lI
Il
space
tl
space
0
gq
e
Tl0
vw
ˆo¨o
ˆa¨a
VW
ˆu¨u
-
[4] P. Natarajan, Z. Lu, R. M. Schwartz, I. Bazzi, and
J. Makhoul, “Multilingual Machine Printed OCR,”
IJPRAI, vol. 15, no. 1, pp. 43–63, 2001.
[5] T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and
F. Shafait, “High Performance OCR for English and
Fraktur using LSTM Networks,” in Int. Conf. on
Document Analysis and Recognition, Aug. 2013.
[6] S. Hochreiter and J. Schmidhuber, “Long Short-Term
Memory,” Nueral Computation, vol. 9, no. 8, pp.
1735–1780, 1997.
[7] A. Graves, M. Liwicki, S. Fernandez, Bertolami,
H. Bunke, and J. Schmidhuber, “A Novel
Connectionist System for Unconstrained Handwriting
Recognition,” IEEE Trans. on Pattern Analysis and
Machine Intelligence, vol. 31, no. 5, pp. 855–868, May
2008.
[8] J. L. Elman, “Finding Structure in Time.” Cognitive
Science, vol. 14, no. 2, pp. 179–211, 1990.
[9] H. Jaeger, “Tutorial on Training Recurrent Neural
Networks, Covering BPTT, RTRL, EKF and the
‘Echo State Network’ approach,” Sankt Augustin,
Tech. Rep., 2002.
[10] A. W. Senior, “Off-line Cursive Handwriting
Recognition using Recurrent Neural Networks,” Ph.D.
dissertation, England, 1994.
[11] S. Hochreiter, Y. Bengio, P. Frasconi, and
J. Schmidhuber, “Gradient flow in recurrent nets: the
difficulty of learning long-term dependencies,” in A
Field Guide to Dynammical Recurrent Neural
Netwoks, S. C. Kremer and J. F. Kolen, Eds. IEEE
Press, 2001.
[12] Y. Bengio, P. Smirard, and P. Frasconi, “Learning
long-term dependencies with gradient descent is
difficult,” IEEE Trans. on Neural Networks, vol. 5,
no. 2, pp. 157–166, Mar. 1994.
[13] A. Graves, S. Fernandez, F. Gomes, and
J. Schmidhuber, “Connectionist Temporal
Classification: Labeling Unsegemented Sequence Data
with Recurrent Nerual Networks,” in ICML,
Pennsylvania, USA, 2006, pp. 369–376.
[14] A. Graves, “RNNLIB: A recurrent neural network
library for sequence learning problems.” [Online].
Available: http://sourceforge.net/projects/rnnl
[15] “OCRopus - Open Source Document Analysis and
OCR system.” [Online]. Available:
https://code.google.com/p/ocropus
[16] T. M. Breuel, “The OCRopus open source OCR
system,” in DRR XV, vol. 6815, Jan. 2008, p. 68150F.
[17] H. S. Baird, “Document Image Defect Models ,” in
Structured Document Image Analysis, H. S. Baird,
H. Bunke, and K. Yamamoto, Eds. New York:
Springer-Verlag, 1992.
[18] R. Smith, “An Overview of the Tesseract OCR
Engine,” in ICDAR, 2007, pp. 629–633.
... Here, we use only the models trained with synthetic data for a more direct comparison. It has been shown that sequence learners like RNNs and their variants capture character dependencies and incorporate a character-level language model to the recognition process [35,55,64]. Based on this, we examine the usability of datasets of different languages for recognition of one another by testing a recognizer on datasets other than the one it is trained with. ...
Article
Full-text available
The Ottoman script, which was in use for over five centuries, is an Arabic alphabet-based writing system. It became obsolete after the change of alphabet in Turkey. There are plenty of Ottoman documents, overwhelmingly printed in Naskh style. This work presents a DL-based character recognition system for the printed Ottoman script. We first generate a synthetic text image dataset from a text corpus and then augment it using some image processing methods. We develop a hybrid convolutional neural network-bidirectional long short-term memory recognizer and train it with the original and the augmented datasets. Finally, we apply a transfer learning procedure for adapting the system to real image data. The proposed system obtains 0.11 CER on synthetic data and 0.16 CER on real data comprising of line images from a printed historical Ottoman book.
... The conversion of scanned images of printed historical documents into electronic text by means of OCR has recently made excellent progress, regularly yielding character recognition rates by individually trained models beyond 98% for even the earliest printed books (Springmann et al., 2015;Springmann and Lüdeling, 2017;Reul et al., 2017aReul et al., ,b, 2018, see also this volume). This is due to ⑴ the application of recurrent neural networks with LSTM architecture to the field of OCR (Fischer et al., 2009;Breuel et al., 2013;Ul-Hasan and Breuel, 2013), ⑵ the availability of open source OCR engines which can be trained on specific scripts and fonts such as Tesseract¹ and OCRopus², and ⑶ the possibility to train recognition models on real printed text lines as opposed to generating artifical line images �om existing computer fonts Springmann et al., 2014). ...
... Researchers presented an LSTM-based model for multilingual OCR systems that covered English, French, and German without any language model or dictionary correction [28]. Possessing only 1% recognition error showed its extraordinary performance. ...
Article
Full-text available
Optical Character Recognition (OCR) is a system of converting images, including text,into editable text and is applied to various languages such as English, Arabic, and Persian. While these languages have similarities, their fundamental differences can create unique challenges. In Persian, continuity between Characters, the existence of semicircles, dots, oblique, and left-to-right characters such as English words in the context are some of the most important challenges in designing Persian OCR systems. Our proposed framework, Bina, is designed in a special way to address the issue of continuity by utilizing Convolution Neural Network (CNN) and deep bidirectional Long-Short Term Memory (BLSTM), a type of LSTM networks that has access to both past and future context. A huge and diverse dataset, including about 2M samples of both Persian and English contexts,consisting of various fonts and sizes, is also generated to train and test the performance of the proposed model. Various configurations are tested to find the optimal structure of CNN and BLSTM. The results show that Bina successfully outperformed state of the art baseline algorithm by achieving about 96% accuracy in the Persian and 88% accuracy in the Persian and English contexts.
... It has been shown that sequence learners like RNNs their variants capture character dependencies and incorporate a character level language model to the recognition process [36,55,63]. Based on this, we examine the usability of datasets of different languages for recognition of one another by testing a recognizer on datasets other than the one it is trained with. ...
Preprint
Full-text available
Ottoman script is an Arabic alphabet-basedscript as well. It was a writing system of theTurkish language for several centuries until it was replaced with the modern Turkish script,which is based on the Latin alphabet, in 1928. With the ever increasing digitization campaigns, millions of Ottoman documents are coming to light. But, their contents are not directly accessible, nor they are digitally editable and searchable. OCR and text recognition technologies can bea solution to this problem in the form of auto-mated and semi-automated conversion systems. This study presents a DL-based characterrecognition system for the printed Ottoman script. We first generate a synthetic text image dataset from a text corpus, and then augment it using some image processing methods. We develop a hybrid Convolutional Neural Network-Bidirectional Long Short Term Memory recognizer and train it with the original and the augmented datasets. Finally we apply a Transfer Learning procedure for adapting the system to real image data. The proposed system obtains 0.16 CER on a test set containing line images from a historical printed Ottoman book.
... LSTM comprises multiple memory blocks connected in a way that the previous cell state S t−1 and previous hidden state H t−1 are fed to the next cell as input at a time t. Each memory unit has three multiplicative gates, input gate I t , forget gate F t , and output gate O t and an additive feedback [56]. Forget gate F t controls how much of the old state should be forgotten. ...
Article
A tremendous amount of opinionated data is being added to online platforms every day. Social media has become the primary source of collecting user’s opinions about products they have purchased. These online reviews contain important information for the sellers but the information is available in unstructured natural language text. There is a dire need for an automated approach that can extract sentiment from a large number of unstructured reviews. LSTM has shown state-of-the-art results for sentiment analysis in recent years. The performance of LSTM heavily depends on the architectural design. Rather than applying a trial and error approach which can easily mislead the results, an automated optimization technique should be used for architecture related hyperparameter selection. To address this problem, we propose a new framework to optimize the architecture of LSTM using different meta-heuristics. This is the first systematic study of architectural optimization in the context of sentiment analysis using meta heuristics. We have applied Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Differential Evolution (DE), Firefly, and Cat Swarm Optimization (CSO) for LSTM architecture optimization. The problem of training LSTM has been formulated as an optimization problem and the objective is to maximize F-score. Four benchmark datasets for sentiment analysis have been used in our experiments. PSO and DE show remarkable success in improving the F-score and accuracy. Experimental results demonstrate that the optimal configuration obtained for designing the LSTM architecture using our proposed meta-heuristics significantly improves the accuracy of sentiment analysis.
... It finds applications in reading postal addresses, car plates, bank checks, various forms, offline exam papers, and in digitization of historical documents etc. Moreover, a multilingual OCR solution would be highly appreciated [1]. ...
Article
In this article we consider end-to-end full page Handwritten Text Recognition for offline Kazakh text images written in Cyrillic alphabet using Fully connected CNN and bidirectional LSTM. The model performs training of text segmentation and recognition jointly using a new Kazakh text images dataset, named Kazakh Handwritten Dataset (KHD). The novel method, which we introduce, uses three steps: Start, Follow and Read (SFR). The proposed model makes use of Region Proposal Network in order to find the starting coordinates of lines in the page. For the case when lines are not straight, we introduce a method that pursues text lines until the end of it and prepare it for the last recognition step. The SFR model works for Russian language as well since Russian alphabet is a subset of Kazakh alphabet. The experimental analysis shows that on average the model provides 0.11 Character Error Rate.
Chapter
State-of-the-art handwritten text recognition models make frequent use of deep neural networks, with recurrent and connectionist temporal classification layers, which perform recognition over sequences of characters. This architecture may lead to the model learning statistical linguistic features of the training corpus, over and above graphic features. This in turn could lead to degraded performance if the evaluation dataset language differs from the training corpus language.We present a fundamental study aiming to understand the inner workings of OCR models and further our understanding of the use of RNNs as decoders. We examine a real-world example of two graphically similar medieval documents but in different languages: rabbinical Hebrew and Judeo-Arabic. We analyze, computationally and linguistically, the cross-language performance of the models over these documents, so as to gain some insight into the implicit language knowledge the models may have acquired. We find that the implicit language model impacts the final word error by around 10%. A combined qualitative and quantitative analysis allow us to isolate manifest linguistic hallucinations. However, we show that leveraging a pretrained (Hebrew, in our case) model allows one to boost the OCR accuracy for a resource-scarce language (such as Judeo-Arabic).All our data, code, and models are openly available at https://github.com/anutkk/ilmja.KeywordsOptical character recognitionHandwritten text recognitionTransfer learningLanguage modelHebrew manuscripts
Thesis
La transcription automatique de textes dans les documents historiques manuscrits et imprimés est devenue un processus établi dans les humanités numériques, son utilisation allant des archives ou des bibliothèques à grande échelle aux groupes de recherche et aux chercheurs individuels. Bien que des progrès considérables aient été réalisés ces dernières années pour comprendre les limites et faire progresser l'état de l'art, ces recherches restent largement limitées aux documents écrits dans les systèmes d'écriture européens, et plus particulièrement à l'écriture latine. L'une des cultures littéraires les plus vastes et les plus diverses, largement ignorée par les recherches actuelles sur l'analyse d'images de documents, est l'écriture arabe. Cette thèse contient une étude compréhensive sur les caractéristiques des documents en écriture arabe et les défis qu'ils posent aux systèmes de reconnaissance optique de caractères de pointe, à travers une analyse théorique de l'écriture arabe et deux études de cas de rétro-numérisation sur des documents imprimés classiques et modernes. Les principales limites des méthodes courantes identifiées dans ces études ont ensuite été traitées. Deux méthodes entraînables de segmentation des pages suivant le paradigme de la ligne de base, permettant d'obtenir des résultats comparables à l'état de l'art et comprenant des caractéristiques supplémentaires nécessaires à la segmentation de pages de documents complexes, une méthode simple de traitement des lignes de texte multigraphique et le logiciel ROC flexible Kraken intégrant ces méthodes sont présentés. On montre l'utilité de ce logiciel de ROC non seulement pour la reconnaissance de texte traditionnelle mais aussi pour une nouvelle tâche d’alignement des caractères. En outre, on présente l'environnement de recherche virtuel (ERV) eScriptorium pour l'annotation et la transcription. Cet ERV est spécifiquement conçu pour pouvoir traiter des textes non-latins, dont l'arabe, plus efficacement que les systèmes alternatifs existants. Au cours de ce travail, on a également préparé plusieurs ensembles de données d'entraînement et d'évaluation sous licence ouverte pour la transcription de textes arabes et la segmentation de pages.
Conference Paper
Full-text available
Long Short-Term Memory (LSTM) networks have yielded excellent results on handwriting recognition. This paper describes an application of bidirectional LSTM networks to the problem of machine-printed Latin and Fraktur recognition. Latin and Fraktur recognition differs significantly from handwriting recognition in both the statistical properties of the data, as well as in the required, much higher levels of accuracy. Applications of LSTM networks to handwriting recognition use two-dimensional recurrent networks, since the exact position and baseline of handwritten characters is variable. In contrast, for printed OCR, we used a one-dimensional recurrent network combined with a novel algorithm for baseline and x-height normalization. A number of databases were used for training and testing, including the UW3 database, artificially generated and degraded Fraktur text and scanned pages from a book digitization project. The LSTM architecture achieved 0:6% character-level test-set error on English text. When the artificially degraded Fraktur data set is divided into training and test sets, the system achieves an error rate of 1:64%. On specific books printed in Fraktur (not part of the training set), the system achieves error rates of 0:15% (Fontane) and 1:47% (Ersch-Gruber). These recognition accuracies were found without using any language modelling or any other post-processing techniques.
Conference Paper
Full-text available
Many real-world sequence learning tasks re- quire the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their out- puts into label sequences, their applicability has so far been limited. This paper presents a novel method for training RNNs to label un- segmented sequences directly, thereby solv- ing both problems. An experiment on the TIMIT speech corpus demonstrates its ad- vantages over both a baseline HMM and a hybrid HMM-RNN.
Conference Paper
Full-text available
OCRopus is a new, open source OCR system emphasizing modularity, easy extensibility, and reuse, aimed at both the research community and large scale commercial document conversions. This paper describes the current status of the system, its general architecture, as well as the major algorithms currently being used for layout analysis and text line recognition.
Article
We describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond providing a corpus of text. Although change was required to various modules, including physical layout analysis, and linguistic post-processing, no change was required to the character classifier beyond changing a few limits. The Tesseract classifier has adapted easily to Simplified Chinese. Test results on English, a mixture of European languages, and Russian, taken from a random sample of books, show a reasonably consistent word error rate between 3.72% and 5.78%, and Simplified Chinese has a character error rate of only 3.77%.
Article
A lack of explicit quantitative models of imaging defects due to printing, optics, and digitization has retarded progress in some areas of document image analysis, including syntactic and structural approaches. Establishing the essential properties of such models, such as completeness (expressive power) and calibration (closeness of fit to actual image populations) remain open research problems. Work-in-progress towards a parameterized model of local imaging defects is described, together with a variety of motivating theoretical arguments and empirical evidence. A pseudo-random image generator implementing the model has been built. Applications of the generator are described, including a polyfont classifier for ASCII and a single-font classifier for a large alphabet (Tibetan U-Chen), both of which which were constructed with a minimum of manual effort. Image defect models and their associated generators permit a new kind of image database which is explicitly parameterized and indefinitely extensible, alleviating some drawbacks of existing databases.
Article
Time underlies many interesting human behaviors. Thus, the question of how to represent time in connectionist models is very important. One approach is to represent time implicitly by its effects on processing rather than explicitly (as in a spatial representation). The current report develops a proposal along these lines first described by Jordan (1986) which involves the use of recurrent links in order to provide networks with a dynamic memory. In this approach, hidden unit patterns are fed back to themselves; the internal representations which develop thus reflect task demands in the context of prior internal states. A set of simulations is reported which range from relatively simple problems (temporal version of XOR) to discovering syntactic/semantic features for words. The networks are able to learn interesting internal representations which incorporate task demands with memory demands; indeed, in this approach the notion of memory is inextricably bound up with task processing. These representations reveal a rich structure, which allows them to be highly context-dependent, while also expressing generalizations across classes of items. These representations suggest a method for representing lexical categories and the type/token distinction.
Book
Document image analysis is the automatic computer interpretation of images of printed and handwritten documents, including text, drawings, maps, music scores, etc. Research in this field supports a rapidly growing international industry. This is the first book to offer a broad selection of state-of-the-art research papers, including authoritative critical surveys of the literature, and parallel studies of the architectureof complete high-performance printed-document reading systems. A unique feature is the extended section on music notation, an ideal vehicle for international sharing of basic research. Also, the collection includes important new work on line drawings, handwriting, character and symbol recognition, and basic methodological issues. The IAPR 1990 Workshop on Syntactic and Structural Pattern Recognition is summarized,including the reports of its expert working groups, whose debates provide a fascinating perspective on the field. The book is an excellent text for a first-year graduate seminar in document image analysis,and is likely to remain a standard reference in the field for years.