Content uploaded by Adnan Ul-Hasan
Author content
All content in this area was uploaded by Adnan Ul-Hasan on Feb 26, 2014
Content may be subject to copyright.
Can we build language-independent OCR
using LSTM networks?
Adnan Ul-Hasan
Technical University of Kaiserslautern
67663 Kaiserslautern, Germany
adnan@cs.uni-kl.de
Thomas M. Breuel
Technical University of Kaiserslautern
67663 Kaiserslautern, Germany
tmb@cs.uni-kl.de
ABSTRACT
Language models or recognition dictionaries are usually con-
sidered an essential step in OCR. However, using a lan-
guage model complicates training of OCR systems, and it
also narrows the range of texts that an OCR system can
be used with. Recent results have shown that Long Short-
Term Memory (LSTM) based OCR yields low error rates
even without language modeling. In this paper, we explore
the question to what extent LSTM models can be used for
multilingual OCR without the use of language models. To
do this, we measure cross-language performance of LSTM
models trained on different languages. LSTM models show
good promise to be used for language-independent OCR.
The recognition errors are very low (around 1%) without
using any language model or dictionary correction.
Keywords
MOCR, LSTM Networks, RNN
1. INTRODUCTION
Multilingual OCR (MOCR) is of interest for many rea-
sons; digitizing historic books containing two or more scripts,
bilingual books, dictionaries, and books with line by line
translation are few reasons to have reliable multilingual OCR
systems. However, it (MOCR) also present several unique
challenges as Popat pointed out in context of Google books
project1. Some of the unique challenges are:
•Multiple scripts/languages on a page. (multi-sript iden-
tification)
•Multiple languages in same or similar fonts, like Arabic-
Persian, English-German.
•The same language in multiple scripts, like Urdu in
Nastaleeq and Naskh scripts.
•Archaic and reformed orthographies, e.g. 18th Century
English, Fraktur (historical German), etc.
1http://en.wikipedia.org/wiki/Google Books
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Request permissions from Permissions@acm.org.
MOCR ’13, August 24 2013, Washington, DC, USA
Copyright 2013 ACM 978-1-4503-2114-3/13/08 ...$15.00.
http://dx.doi.org/10.1145/2505377.2505394
There have been efforts reported to adapt the existing
OCR systems for other languages. Open source OCR sys-
tem Tesseract [2] is one such example. The basic classifica-
tion is based on hierarchical shape-classification, where at
first the character set is reduced to few characters and then
at last stage, the test sample is matched against the repre-
sentative of the short set. Although, Tesseract can be used
for a variety of languages (due to support available for many
languages), it can not be used as an all-in-one solution for
situation where we have multiple scripts together.
The usual approach to address multilingual OCR problem
is to somehow combine two or more separate classifiers [3],
as it is believed that a reasonable OCR output for a sin-
gle script can not be obtained without sophisticated post-
processing steps such as language modelling, use of dictio-
nary to correct OCR errors, font adaptation, etc. Natarajan
et al. [4] proposed an HMM-based script-independent multi-
lingual OCR system. Feature extraction, training and recog-
nition components are all language independent; however,
they use language specific word lexicon and language models
for recognition purpose. To our best knowledge, there was
not a single method proposed for OCR, that can achieve
very low error rates without using aforementioned sophis-
ticated post-processing techniques. But recent experiments
on English and German script using LSTM networks [5] have
shown that reliable OCR results can be obtained without
such techniques.
Our hypothesis for multilingual OCR is that if a single
model, at least for a family of scripts, e.g. Latin, Arabic,
Indic can be obtained, we can then use this single model to
recognize scripts of that particular family; thereby reduc-
ing the efforts to combine multiple classifiers. Since LSTM
networks can achieve very low error-rates without using lan-
guage modelling post-processing step; they can be used for
multilingual OCR.
In this paper, we report the results of applying LSTM
networks to address multilingual OCR problem. The ba-
sic aim is to benchmark how LSTM networks use language
modelling to predict the correct labels or can we do better
without using language modelling and other post-processing
steps. Additionally, we also want to see how well LSTM
networks use context to recognize a particular character.
Specifically, we trained LSTM networks for English, Ger-
man, French and a mix set of these three languages and test
them on each other. LSTM network based models achieve
very high recognition accuracy without the aid of language
modelling and they have shown good promise to be used for
multilingual OCR tasks.
Figure 1: Some sample images from our database. There are 96 variations in standard fonts used in common
practice, e.g. for times true-type fonts; its normal, italic, bold and italic-bold variations were included. Also,
note that these images were degraded to reflect scanning artefacts.
In what follows, preprocessing step is reported in next
section, Section 3 describes the configuration of the LSTM
network used in the experiments, Section 4 gives the details
of experimental evaluation. Section 5 concludes the paper
with discussions on the current work and future directions.
2. PREPROCESSING
Scale and relative position of a character are important
features to distinguish characters in Latin script (and some
other scripts). So, text line normalization is an essential step
in applying 1D LSTM networks to OCR. In this work, we
used the normalization approach introduced in [5], namely
text-line normalization based on a trainable, shape-based
model. A token dictionary created from a collection of a
bunch of text lines contains information about x-height,
baseline (geometric features) and shape of individual charac-
ters. These models are then used to normalize any text-line
image.
3. LSTM NETWORKS
Recurrent Neural Networks (RNNs) have shown a greate
promise in recent times due to the Long Short Term Mem-
ory (LSTM) architecture [6], [7]. The LSTM architecture
differs significantly from earlier architectures like Elman net-
works [8] and echo-state networks [9]; and appears to over-
come many of the limitations and problems of those earlier
architectures.
Traditinoal RNNs, though are good at context-aware pro-
cessing [10], have not shown vying performance for OCR and
speech recognition tasks. Their incompetence is reported
mainly due to the vanishing gradient problem [11, 12]. The
Long Short Term Memory [6] architecture was designed to
overcome this problem. It is a highly non-linear recurrent
network with multiplicative “gates” and additive feedback.
Graves et al. [7] introduced bidirectional LSTM architecture
for accessing context in both forward and backward direc-
tions. Both layers are then connected to a single output
layer. To avoid the requirement of segmented training data,
Graves et al. [13] used a forward backward algorithm to align
transcripts with the output of the neural network. Interested
reader is suggested to see the above-mentioned references for
further details regarded LSTM and RNN architectures.
For recognition, we used a 1D bidirectional LSTM archi-
tecture, as described in [7]. We found that 1D architec-
ture outperforms their 2D or higher dimensional siblings for
printed OCR tasks. For all the experiments reported in this
paper, we used a modified version of the LSTM library de-
scribed in [14]. That library provides 1D and multidimen-
sional LSTM networks, together with ground-truth align-
ment using a forward-backward algorithm (“CTC”, connec-
tionist temporal classification; [13]). The library also pro-
vides a heuristic decoding mechanism to map the frame-wise
network output onto a sequence of symbols. We have reim-
plemented LSTM networks and forward-backward alignment
from scratch and reproduced these results (our implementa-
tion uses a slightly different decoding mechanism). This im-
plementation has been released as an open-source form [15]
(ocropus version 0.7 ).
During the training stage, randomly chosen input text-
line images are presented as 1D sequences to forward prop-
agation step through LSTM cells and then the forward-
backward alignment of the output is performed. Errors are
then back-propagated to update weights and the process is
then repeated for the next randomly selected text-line im-
age. It is to be noted that raw pixel values are being used
as the only features and other sophisticated features were
extracted from the text-line images. The implicit features
in 1D sequence are baseline and x-heights of individual char-
acters.
4. EXPERIMENTAL EVALUATION
The aim of our experiments was to evaluate LSTM per-
formance on multilingual OCR without the aid of language
modelling and other language-specific assistance. To explore
the cross-language performance of LSTM networks, a num-
ber of experiments were performed. We trained four sep-
arate LSTM networks for English, German, French and a
mixed set of all these languages. For testing, we have a to-
tal of 16 permutation. Each LSTM network was tested on
Table 1: Statistics on number of text-line images
in each of English, French, German and mix-script
datasets.
Language Total Training Test
English 85,350 81,600 4750
French 85,350 81,600 4750
German 1,14,749 1,10,400 4349
Mixed-script 85,350 81,600 4750
Table 2: Experimental results of applying LSTM networks for multilingual OCR. These results validate our
hypothesis that a single LSTM model trained with a mixture of scripts (from a single family of script) can
be used to recognize text of individual family members. Note that the error rates of testing LSTM network
trained for German on French and networks trained for English on French and German were obtained by
ignoring the words containing special characters (umlauts and accented letters) to correctly gauge the affect
of language models of a particular language. LSTM networks trained for individual languages can also be
used to recognize other scripts, but they show some language dependence. All these results were achieved
without the aid of any language model.
XXXXXXXX
X
Script
Model English (%) German (%) French (%) Mixed (%)
English 0.5 1.22 4.11.06
German 2.04 0.85 4.71.2
French 1.8 1.4 1.11.05
Mixed-script 1.7 1.1 2.9 1.1
the respective script and on other three scripts, e.g. test-
ing LSTM network trained on German on German, French,
English and mixed-script. These results are summarized in
Table 2, and some sample outputs are presented in Table 3.
As error measure, we used the ratio of insertions, deletions
and substitution relative to the ground-truth and accuracy
was measured at character level.
4.1 Database
A separate synthetic database for each language was de-
veloped using OCRopus [16] (ocropus-linegen). This utility
requires a bunch of utf-8 encoded text files and a set of
true-type fonts. With these two things available, one can
artificially generate any number of text-line images. This
utility also provide control to induce scanning artefacts such
as distortion, jitter, and other degradations. Separate cor-
pora of text-line images in German, English and French
languages were generated in commonly used fonts (includ-
ing bold, italic, italic-bold) from freely available literature.
These images were degraded using degradation models [17]
to reflect scanning artefacts. There are four degradation pa-
rameters, namely elastic elongation, jitter, sensitivity and
threshold. Sample text-lines images in our database are
shown in Figure 1. Each database is further divided into
training and test datasets. Statistics on number of text line
images in each four scripts is given in Table 1.
4.2 Parameters
The text lines were normalized to a height of 32 in pre-
processing step. Both left-to-right and right-to-left LSTM
layers contain 100 LSTM memory blocks. The learning rate
was set to 1e−4, and the momentum was set to 0.9. The
training was carried out for one million steps (roughly cor-
responding to 100 epochs, given the size of the training set).
Training errors were reported every 10,000 training steps
and plotted. The network corresponding to the minimum
training error was used for test set evaluation.
4.3 Results
Since, there are no umlauts (German) and accented (French)
letters in English, so while testing LSTM model trained for
German on French and model trained for English on French
and German, the words containing those special characters
were omitted from the recognition results. The reason to do
this was to able to correctly gauge the affect of not-using
language models. If those words were not removed, then the
resulting error would also contain a proportion of errors due
to character mis-recognition. So by removing those words
with special characters, the true performance of the LSTM
network trained for language containing lesser alphabets on
the language containing more alphabets can be evaluated.
It should be noted that these results were obtained without
the aid of any post-processing step, like language modelling,
use of dictionaries to correct OCR errors, etc.
LSTM model trained for mixed-data was able to obtain
similar recognition results (around 1% recognition error)
when applied to English, German and French script indi-
vidually. Other results indicate small language dependence
in that LSTM models trained for a single language yielded
lower error rates when tested on the same script than when
they are evaluated on other scripts.
To gauge the magnitude of affect of language modelling,
we compared our results with Tesseract open-source OCR
system [18]. We applied latest available models (as of sub-
mission date) of English, French and German on the same
test-data. Tesseract system achieved high rates as com-
pared to LSTM based models. Tesseract’s model for En-
glish yielded 1.33%, 5.02%, 5.09% and 4.82% recognition
error when applied to English, French, German and Mixed-
data respectively. Model for French yielded 2.06%, 2.7%,
3.5% and 2.96% recognition error when applied to English,
German and Mixed-data respectively, while model for Ger-
man yielded 1.85%, 2.9%, 6.63% and 4.36% recognition er-
ror when applied to English, French and Mixed-data re-
spectively. So, these results show that absence of language
modelling or applying different language models affects the
recognition. Since no model for mixed data is available for
Tesseract, the effect of evaluating such a model on individual
script could not be computed.
5. DISCUSSION AND CONCLUSIONS
The results presented in this paper show that LSTM net-
works can be used for multilingual OCR. LSTM networks
do not learn a particular language model internally (nor we
need such models as post-processing step). They show great
promise to learn various shapes of a certain character in dif-
ferent fonts and under degradations (as evident from our
highly versatile data). The language dependence is observ-
Table 3: Sample outputs from four LSTM networks trained for English, German, French and Mixed data.
LSTM net trained on a specific language is unable to recognize special characters of other languages as they
were not part of training. Therefore, it is necessary to ignore these errors from final error score. Thus we
can train an LSTM model for mix-data of a family of script and can use it to recognize individual language
of this family with very low recognition error.
Text-line Image
English
German
French
Mixed-data
Text-line Image
English
German
French
Mixed-data H
Text-line Image
English
German
French
HMixed-data
able, but the affects are small as compared to other state-
of-the-art OCR, where absence of language models results
in relatively bad results. To gauge the language dependence
more precisely, one can evaluate the performance of LSTM
by training LSTM networks on randomly generated data
using n-gram statistics and testing those models on natural
languages. Currently, we are working in this direction and
the results will be reported elsewhere.
In the following, we will analyse the errors made by our
LSTM networks when applied to other scripts. Top 5 con-
fusions for each case are tabulated in Table 4. The case of
applying an LSTM network to the same language for which
it was trained is not discussed here as it is not relevant for
the discussion of cross-language performance of LSTM net-
works.
Most of the errors caused by LSTM network trained on
mixed-data are non-recognition (deletion) of certain char-
acters like l,t,r,i. These errors may be removed by better
training.
Looking at the first column of Table 4 (Applying LSTM
network trained for English on other 3 scripts), most of the
errors are due to the confusion between characters of similar
shapes, like Ito l(and vice verca), Zto 2 and cto e. Two
confusions namely Zwith Aand Zwith Lare interesting as,
apparently, there are no shape similarity between them. One
possibility of such a behaviour may be due to the fact that
Zis the least frequent letter in English2and thus there may
be not many Zs in the training samples, thereby resulting
in its poor recognition. Two other noticeable errors (also in
other models) are unrecognised space and ’(denotes that
this letter was deleted).
2http://en.wikipedia.org/wiki/Letter frequency
For LSTM networks trained on German language (second
column), most of the top errors are due to inability to rec-
ognize a particular letter. Top errors when applying LSTM
network trained for French language on other scripts are con-
fusion between w/W with v/V. An interesting observation,
which could be a possible reason for such behaviour, is that
relative frequency of w(see footnote) is very low in French.
In other words, ‘w’ may be considered as a special character
w.r.t. French language when applying French model to Ger-
man and English. So, this is a language dependent issue,
which is not observable in case of mix-data.
This work can be extended in future in many directions.
First, more European languages like Italian, Spanish, Dutch
may be included in current set-up to train an all-in-one
LSTM network for these languages. Secondly, other fam-
ilies of script especially Nabataean and Indic scripts can be
tested to further validate our hypothesis empirically.
6. REFERENCES
[1] A. C. Popat, “Multilingual OCR Challenges in Google
Books,” 2012. [Online]. Available:
http://dri.ie/sites/default/files/files/popat multilingual
ocr challenges-handout.pdf
[2] R. Smith, D. Antonova, and D. S. Lee, “Adapting the
Tesseract Open Source OCR Engine for Multilingual
OCR,” in Int. Workshop on Multilingual OCR, Jul.
2009.
[3] M. A. Obaida, M. J. Hossain, M. Begum, and M. S.
Alam, “Multilingual OCR (MOCR): An Approach to
Classify Words to Languages,” Int’l Journal of
Computer Applications, vol. 32, no. 1, pp. 46–53, Oct.
2011.
Table 4: Top confusions for applying LSTM models for various tasks. The confusions for an LSTM models for
which it was trained are not mentioned as it is unnecessary for our present paper. shows the garbage class,
i.e. the character is not recognized at all. When the LSTM net trained on English was applied to recognize
other scripts, the resulting top errors are similar: shape confusions between characters. Non-recognition of
“space” and “ ’” are other noticeable errors. For network trained on German language, most errors are due
to deletion of characters. Confusion of w/W with v/V are the top confusions when LSTM network trained
on French was applied to other scripts.
XXXXXXXX
X
Script
Model English German French Mixed
English -←space
←c
←t
←0
v←y
v←w
vv ←w
←space
←w
l←I
←space
←t
←0
l←I
←l
German l←I
L←Z
A←Z
c←e
2←Z
-v←w
ˆu←¨u
V←W
←space
vv ←w
←space
←t
←l
←i
←r
French ←0
←space
I←l
t←l
I←!
←space
←0
e←
←c
←l
-←space
←i
e←´e
←l
←0
Mixed-script ←0
l←I
I←l
←space
t←l
←space
←0
g←q
e←
T←l0
v←w
ˆo←¨o
ˆa←¨a
V←W
ˆu←¨u
-
[4] P. Natarajan, Z. Lu, R. M. Schwartz, I. Bazzi, and
J. Makhoul, “Multilingual Machine Printed OCR,”
IJPRAI, vol. 15, no. 1, pp. 43–63, 2001.
[5] T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and
F. Shafait, “High Performance OCR for English and
Fraktur using LSTM Networks,” in Int. Conf. on
Document Analysis and Recognition, Aug. 2013.
[6] S. Hochreiter and J. Schmidhuber, “Long Short-Term
Memory,” Nueral Computation, vol. 9, no. 8, pp.
1735–1780, 1997.
[7] A. Graves, M. Liwicki, S. Fernandez, Bertolami,
H. Bunke, and J. Schmidhuber, “A Novel
Connectionist System for Unconstrained Handwriting
Recognition,” IEEE Trans. on Pattern Analysis and
Machine Intelligence, vol. 31, no. 5, pp. 855–868, May
2008.
[8] J. L. Elman, “Finding Structure in Time.” Cognitive
Science, vol. 14, no. 2, pp. 179–211, 1990.
[9] H. Jaeger, “Tutorial on Training Recurrent Neural
Networks, Covering BPTT, RTRL, EKF and the
‘Echo State Network’ approach,” Sankt Augustin,
Tech. Rep., 2002.
[10] A. W. Senior, “Off-line Cursive Handwriting
Recognition using Recurrent Neural Networks,” Ph.D.
dissertation, England, 1994.
[11] S. Hochreiter, Y. Bengio, P. Frasconi, and
J. Schmidhuber, “Gradient flow in recurrent nets: the
difficulty of learning long-term dependencies,” in A
Field Guide to Dynammical Recurrent Neural
Netwoks, S. C. Kremer and J. F. Kolen, Eds. IEEE
Press, 2001.
[12] Y. Bengio, P. Smirard, and P. Frasconi, “Learning
long-term dependencies with gradient descent is
difficult,” IEEE Trans. on Neural Networks, vol. 5,
no. 2, pp. 157–166, Mar. 1994.
[13] A. Graves, S. Fernandez, F. Gomes, and
J. Schmidhuber, “Connectionist Temporal
Classification: Labeling Unsegemented Sequence Data
with Recurrent Nerual Networks,” in ICML,
Pennsylvania, USA, 2006, pp. 369–376.
[14] A. Graves, “RNNLIB: A recurrent neural network
library for sequence learning problems.” [Online].
Available: http://sourceforge.net/projects/rnnl
[15] “OCRopus - Open Source Document Analysis and
OCR system.” [Online]. Available:
https://code.google.com/p/ocropus
[16] T. M. Breuel, “The OCRopus open source OCR
system,” in DRR XV, vol. 6815, Jan. 2008, p. 68150F.
[17] H. S. Baird, “Document Image Defect Models ,” in
Structured Document Image Analysis, H. S. Baird,
H. Bunke, and K. Yamamoto, Eds. New York:
Springer-Verlag, 1992.
[18] R. Smith, “An Overview of the Tesseract OCR
Engine,” in ICDAR, 2007, pp. 629–633.