The task of printed Optical Character Recognition (OCR) is considered a “solved”
issue by many Pattern Recognition (PR) researchers. The notion, however, partially
true, does not represent the whole picture. Although, it is true that state-of-the-art
OCR systems for many scripts exist, for example, for Latin, Greek, Han (Chinese), and
Kana (Japanese), there is still a need for exhaustive research for many other challeng-
ing modern scripts. Example of such scripts are: cursive Nabataean, which include
Arabic, Persian, and Urdu; and the Brahamic family of scripts, which contain Devana-
gari, Sanskrit, and its derivatives. These scripts present many challenging issues for
OCR, for example, change in shape of character within a word depending upon its lo-
cation, kerning, and a huge number of ligatures. Moreover, OCR research for histori-
cal documents still requires much probing; therefore, efforts are required to develop
robust OCR systems to preserve the literary heritage.
Likewise, there is a need to address the issue of OCR of multilingual documents.
Plenty of multilingual documents exist in the current time of globalization, which has
increased the influence of different languages on each other. There is an increase in
the usage of foreign words and phrases in articles, newspapers, and books, which are
generating a large body of multilingual literature everyday. Another effect is seen in
the products we use in our daily lives. From packaging of imported food items to so-
phisticated electronics, the demand of international customers to access information
about these products in their native language is ever increasing. The use of multilin-
gual operational manuals, books, and dictionaries motivates the need to have multi-
lingual OCR systems for their digitization.
The aim of this thesis is to find the answers to some of these challenges using
the contemporary machine learning methodologies, especially the Recurrent Neural
Networks (RNN). Specifically, a recent architecture of these networks, referred to as
Long Short-Term Memory (LSTM) networks, has been employed to OCR modern as
well historical documents. The excellent OCR results obtained on these documents
encourage us to extend their application to the field of multilingual OCR.
The LSTM networks are first evaluated on standard English datasets to benchmark
their performance. They yield better recognition results than any other contempo-
rary OCR techniques without using sophisticated features and language modeling.
Therefore, their application is further extended to more complex scripts that include
Urdu Nastaleeq and Devanagari. For Urdu Nastaleeq script, LSTM networks achievethe best reported OCR results (2.55% Character Error Rate (CER)) on a publicly avail-
able data set, while for Devanagari script, a new freely available database has been
introduced on which CER of 9% is achieved.
The LSTM-based methodology is further extended to the OCR of historical docu-
ments. In this regard, this thesis focuses on Old German Fraktur script, medieval Latin
script of the 15 th century, and the Polytonic Greek script. LSTM-based systems out-
perform the contemporary OCR systems on all of these scripts. For old documents, it
is usually very hard to prepare transcribed dataset for training a neural network in su-
pervised learning paradigm. A novel methodology has been proposed by combining
segmentation-based and segmentation-free approaches to OCR scripts for which no
transcribed training data is available. For German Fraktur and Polytonic Greek scripts,
artificially generated data from existing text corpora yield highly promising results
(CER of <1% and 5% for Fraktur and Polytonic Greek scripts respectively).
Another major contribution of this thesis is an efficient Multilingual OCR (MOCR)
system, which has been achieved in two ways. Firstly, a sequence-learning based script
identification system has been developed that works at the text-line level; thereby,
eliminating the need to segment individual characters or words prior to actual script
identification. And secondly, a unified approach for dealing with different types of
multilingual documents has been proposed. The core motivation behind this gen-
eralized framework is the reading ability of the human mind to process multilingual
documents, where no script identification takes place. In this design, the LSTM net-
works recognize multiple scripts simultaneously without the need to identify differ-
ent scripts. The first step in building this framework is the realization of a language
independent OCR system that recognizes multilingual text in a single step, using a sin-
gle Long Short-Term Memory (LSTM) model. The language independent approach is
then extended to a script independent OCR framework that can recognize multiscript
documents using a single OCR model. The proposed generalized approach yields low
error rate (1.2%) on a database of English-Greek bilingual documents.
In summary, this thesis aims to extend the OCR research, from modern Latin scripts
to old Latin, to Greek and to other “underprivileged” scripts such as Devanagari and
Urdu Nastaleeq. It also provides a generalized OCR framework in dealing with multi-
lingual documents.