Conference Paper

OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The second socio-cultural challenge is the difficulty in finding the Balinese philologist to work within this project. Annotating a complex script like the Balinese script in palm leaf manuscripts require language specific expertise [29,51].There are not many Balinese who can read well the Balinese script. Actually, the Balinese script is taught in elementary school for all students. ...
... Ground truthing the documents manually is very costly in terms of man-hours [29,51]. The manual ground truthing process can surely be faster by asking more people as ground truther, under condition that there exist those people who are eligible or expertized to do that works. ...
... The automatic parts can be applied in the initial step or it can be inserted in the middle step of the ground truthing process. For example, for an OCR system with no or less available transcribed data, the framework of OCRoRACT [51] and anyOCR [29] proposed an approach to minimize the requirement of a language expert for manually transcribing documents. The semi-correct ground truth of the Unicode for character clusters are identified by the language expert after a semiautomatic text line and character segmentation. ...
Thesis
The collection of palm leaf manuscripts is an important part of Southeast Asian people’s culture and life. Following the increasing of the digitization projects of heritage documents around the world, the collection of palm leaf manuscripts in Southeast Asia finally attracted the attention of researchers in document image analysis (DIA). The research work conducted for this dissertation focused on the heritage documents of the collection of palm leaf manuscripts from Indonesia, especially the palm leaf manuscripts from Bali. This dissertation took part in exploring DIA researches for palm leaf manuscripts collection. This collection offers new challenges for DIA researches because it uses palm leaf as writing media and also with a language and script that have never been analyzed before. Motivated by the contextual situations and real conditions of the palm leaf manuscript collections in Bali, this research tried to bring added value to digitized palm leaf manuscripts by developing tools to analyze, to transliterate and to index the content of palm leaf manuscripts. These systems aim at making palm leaf manuscripts more accessible, readable and understandable to a wider audience and, to scholars and students all over the world. This research developed a DIA system for document images of palm leaf manuscripts, that includes several image processing tasks, beginning with digitization of the document, ground truth construction, binarization, text line and glyph segmentation, ending with glyph and word recognition, transliteration and document indexing and retrieval. In this research, we created the first corpus and dataset of the Balinese palm leaf manuscripts for the DIA research community. We also developed the glyph recognition system and the automatic transliteration system for the Balinese palm leaf manuscripts. This dissertation proposed a complete scheme of spatially categorized glyph recognition for the transliteration of Balinese palm leaf manuscripts. The proposed scheme consists of six tasks: the text line and glyph segmentation, the glyph ordering process, the detection of the spatial position for glyph category, the global and categorized glyph recognition, the option selection for glyph recognition and the transliteration with phonological rules-based machine. An implementation of knowledge representation and phonological rules for the automatic transliteration of Balinese script on palm leaf manuscript is proposed. The adaptation of a segmentation-free LSTM-based transliteration system with the generated synthetic dataset and the training schemes at two different levels (word level and text line level) is also proposed.
... As there is a paradigm shift due to outstanding performance of deep learning algorithms, usage of deep networks in OCR becomes popular. Multilayer perceptron, CNN, Recurrent Neural Networks (RNN), LSTM and its variations are extensively used for OCR [9]- [13]. Raw images of UPTI [1] data set are used to train Multi-dimensional LSTM [13]. ...
... Besides deep networks, OCRopus, Tesseract and OCRoRACT are also very popular for different languages such as Latin, Urdu and Devanagri [9], [18], [19]. Segmentation based (when ligatures are divided into characters) and segmentation free (ligature based), both approaches are used for OCR systems for these languages. ...
Article
Full-text available
Abstract: Urdu language uses cursive script which results in connected characters constituting ligatures. For identifying characters within ligatures of different scales (font sizes), Convolution Neural Network (CNN) and Long Short Term Memory (LSTM) Network are used. Both network models are trained on formerly extracted ligature thickness graphs, from which models extract Meta features. These thickness graphs provide consistent information across different font sizes. LSTM and CNN are also trained on raw images to compare performance on both forms of inputs. For this research, two corpora, i.e. Urdu Printed Text Images (UPTI) and Centre for Language Engineering (CLE) Text Images are used. Overall performance of networks ranges between 90% and 99.8%. Average accuracy on Meta features is 98.08% while using raw images, 97.07% average accuracy is achieved.
... However, segmenting (Urdu and alike) cursive scripts into characters is a challenging task in itself. Recently, implicit segmentation using deep learning has been successfully investigated for recognition of Urdu text [23][24][25][26]. These techniques, however, require large training data and employ characters as units of recognition rather than ligatures or words. ...
... A critical analysis of the literature on Urdu OCR systems reveals that the problem has attracted significant research attention during the last 10 years. While the initial endeavors primarily focused on recognition of isolated characters [6][7][8], a number of deep learning-based robust solutions [17,[23][24][25][26] have been proposed in the recent years. These methods mainly rely on implicit segmentation of characters and report high recognition rates. ...
Article
Full-text available
This paper presents a segmentation-free optical character recognition system for printed Urdu Nastaliq font using ligatures as units of recognition. The proposed technique relies on statistical features and employs Hidden Markov Models for classification. A total of 1525 unique high-frequency Urdu ligatures from the standard Urdu Printed Text Images (UPTI) database are considered in our study. Ligatures extracted from text lines are first split into primary (main body) and secondary (dots and diacritics) ligatures and multiple instances of the same ligature are grouped into clusters using a sequential clustering algorithm. Hidden Markov Models are trained separately for each ligature using the examples in the respective cluster by sliding right-to-left the overlapped windows and extracting a set of statistical features. Given the query text, the primary and secondary ligatures are separately recognized and later associated together using a set of heuristics to recognize the complete ligature. The system evaluated on the standard UPTI Urdu database reported a ligature recognition rate of 92% on more than 6000 query ligatures
... Utiliser des pseudo-vérités de terrain (Ul-Hasan et al., 2016) proposent d'utiliser la sortie d'un logiciel d'OCR (en l'occurrence, Tesseract) comme une pseudo-vérité de terrain sur laquelle est appris un premier modèle. Si l'objectif de ce travail n'est pas d'estimer la qualité d'une sortie d'OCR, les auteurs se soucient du manque de transcriptions à disposition et atteignent avec ces pseudo-vérités de terrain des précisions de l'ordre de 95% sur des documents imprimés du XVII e siècle. ...
... Handwritten text recognitionan overview HTR is an active research area in the computational sciences, dating back to the midtwentieth century (Dimond, 1957). HTR was originally closely aligned to the development of optical character recognition (OCR) technology, where scanned images of printed text are converted into machine-encoded text, generally by comparing individual characters with existing templates (Govindan and Shivaprasad, 1990;Schantz, 1982;Ul-Hasan et al., 2016). HTR developed into a research area in its own right due to the variability of different hands, and the computational complexity of the task (Bertolami and Bunke, 2008;Kichuk, 2015; 955 Transforming scholarship in the archives Leedham, 1994;Sudholt and Fink, 2016). ...
... Deep convolutional neural network has been used for handwritten Devanagari text [14]. The iterative procedure using segmentation-free OCR was able to reduce the initial character error of about 23% (obtained from segmentation-based OCR) to less than 7% in few iterations [15]. There has been a lot of work for different scripts and little work for text document images in multi-font for same scripts. ...
Chapter
Full-text available
Current research in OCR Kundaikar, Teja focusing on the effect of multi-font and multi-size text on Pawar, Jyoti D. accuracy. To the best of our knowledge, no study has been carried out to study the effect of multi-fonts and multi-size text on the accuracy of Devanagari OCRs. The most popular Devanagari OCRs in the market today are Tesseract OCR, Indsenz OCR and eAksharayan OCR. In this research work, we have studied the effect of font styles, namely Nakula, Baloo, Dekko, Biryani and Aparajita on these three OCRs. It has been observed that the accuracy of the Devanagari OCRs is dependent on the type of font style in text document images. Hence, we have proposed a multi-font Devanagari OCR (MFD_OCR), text line recognition model using long short-term memory (LSTM) neural networks. We have created training dataset Multi_Font_Train, which consists of text document images and its corresponding text file. This consists of each text line in five different font styles, namely Nakula, Baloo, Dekko, Biryani and Aparajita. The test dataset is created using the text from benchmark dataset [1] for each of the font styles as mentioned above, and they are named as BMT_Nakula, BMT_Baloo, BMT_Dekko, BMT_Biryani and BMT_Aparajita test dataset. On the evaluation of all OCRs, the MFD_OCR showed consistent accuracy across all these test datasets. It obtained comparatively good accuracy for BMT_Dekko and BMT_Biryani test datasets. On performing detailed error analysis, we noticed that compared to other Devanagari OCRs, the MFD_OCR has consistent, insertion and deletion type of errors, across all test dataset for each font style. The deletion errors are negligible, ranging from 0.8 to 1.4%.
... Handwritten text recognitionan overview HTR is an active research area in the computational sciences, dating back to the midtwentieth century (Dimond, 1957). HTR was originally closely aligned to the development of optical character recognition (OCR) technology, where scanned images of printed text are converted into machine-encoded text, generally by comparing individual characters with existing templates (Govindan and Shivaprasad, 1990;Schantz, 1982;Ul-Hasan et al., 2016). HTR developed into a research area in its own right due to the variability of different hands, and the computational complexity of the task (Bertolami and Bunke, 2008;Kichuk, 2015; 955 Transforming scholarship in the archives Leedham, 1994;Sudholt and Fink, 2016). ...
Article
Purpose An overview of the current use of handwritten text recognition (HTR) on archival manuscript material, as provided by the EU H2020 funded Transkribus platform. It explains HTR, demonstrates Transkribus , gives examples of use cases, highlights the affect HTR may have on scholarship, and evidences this turning point of the advanced use of digitised heritage content. The paper aims to discuss these issues. Design/methodology/approach This paper adopts a case study approach, using the development and delivery of the one openly available HTR platform for manuscript material. Findings Transkribus has demonstrated that HTR is now a useable technology that can be employed in conjunction with mass digitisation to generate accurate transcripts of archival material. Use cases are demonstrated, and a cooperative model is suggested as a way to ensure sustainability and scaling of the platform. However, funding and resourcing issues are identified. Research limitations/implications The paper presents results from projects: further user studies could be undertaken involving interviews, surveys, etc. Practical implications Only HTR provided via Transkribus is covered: however, this is the only publicly available platform for HTR on individual collections of historical documents at time of writing and it represents the current state-of-the-art in this field. Social implications The increased access to information contained within historical texts has the potential to be transformational for both institutions and individuals. Originality/value This is the first published overview of how HTR is used by a wide archival studies community, reporting and showcasing current application of handwriting technology in the cultural heritage sector.
... Segmentation-free OCR training has been very well detailed in Ul-Hasan et al. [17]. It replaces manual ground truth production by first training a standard OCR system (Tesseract) on a historically reconstructed typeface with subsequent OCR training (OCRopus) on the actual book using Tesseract's output as pseudo ground truth. ...
... The Kallimachos project 4 at Würzburg University did have success with Franken+ to reach accuracies over 95% for an incunable printing (Kirchner et al 2016) but this method relies again on creating diplomatic transcriptions from scratch for each individual typeface. The method proposed by Ul-Hasan et al (2016) to circumvent ground truth production by first training Tesseract on a historically reconstructed typeface with subsequent OCRopus training on the actual book using Tesseract's recognition as pseudo ground truth has also achieved accuracies above 95% but shifts the transcription effort to the manual (re)construction of the typeface. ...
Article
Full-text available
Good OCR results on historical documents rely on diplomatic transcriptions of printed material as ground truth which is both a scarce resource and time-consuming to generate. A strategy is proposed which starts from a mixed model trained on already available transcriptions from different centuries giving accuracies over 90% on a test set from the same period of time, overcoming the typography barrier of having to train individual models separately for each historical typeface. It is shown that both mean character confidence (as output by the OCR engine OCRopus) and lexicality (a measure of correctness of OCR tokens compared to a lexicon of modern wordforms taking historical spelling patterns into account, which can be calculated for any OCR engine) correlate with true accuracy determined from a comparison of OCR results with ground truth. These measures are then used to guide the training of new individual OCR models either using OCR prediction as pseudo ground truth (fully automatic method) or choosing a minimum set of hand-corrected lines as training material (manual method). Already 40-80 hand- corrected lines lead to OCR results with character error rates of only a few percent. This procedure minimizes the amount of ground truth production and does not depend on the previous construction of a specific typographic model.
Article
Historical document processing (HDP) corresponds to the task of converting the physical-bind form of historical archives into a web-based centrally digitized form for their conservation , preservation , and ubiquitous access . Besides the conservation of these invaluable historical collections, the key agenda is to make these geographically distributed historical repositories available for information mining and retrieval in a web-centralized touchless mode . Being a matter of interest for interdisciplinary scholars, the endeavor has garnered the attention of many researchers resulting in an immense body of the literature dedicated to digitization strategies. The present study first assembles the prevalent tasks essential for HDP into a pipeline and frames an outline for a generic workflow for historical document digitization. Then, it reports the latest task-specific state of the art which gives a brief discourse on the methods and open challenges in handling historical printed and handwritten script images. Next, grounded on various layout attributes, it further talks about the evaluation metrics and datasets available for observational and analytical purposes. The current study is an attempt to trail the contours of undergoing research and its bottlenecks thus, providing readers with a comprehensive view and understanding of existing studies and unfolding the open avenues for the future outlook.
Preprint
Historical Document Processing is the process of digitizing written material from the past for future use by historians and other scholars. It incorporates algorithms and software tools from various subfields of computer science, including computer vision, document analysis and recognition, natural language processing, and machine learning, to convert images of ancient manuscripts, letters, diaries, and early printed texts automatically into a digital format usable in data mining and information retrieval systems. Within the past twenty years, as libraries, museums, and other cultural heritage institutions have scanned an increasing volume of their historical document archives, the need to transcribe the full text from these collections has become acute. Since Historical Document Processing encompasses multiple sub-domains of computer science, knowledge relevant to its purpose is scattered across numerous journals and conference proceedings. This paper surveys the major phases of, standard algorithms, tools, and datasets in the field of Historical Document Processing, discusses the results of a literature review, and finally suggests directions for further research.
Article
Full-text available
Urdu optical character recognition (OCR) is a complex problem due to the nature of its script, which is cursive. Recognizing characters of different font sizes further complicates the problem. In this research, long short term memory-recurrent neural network (LSTM-RNN) and convolution neural network (CNN) are used to recognize Urdu optical characters of different font sizes. LSTM-RNN is trained on formerly extracted feature sets, which are extracted for scale invariant recognition of Urdu characters. From these features, LSTM-RNN extracts meta features. CNN is trained on raw binary images. Two benchmark datasets, i.e. centre for language engineering text images (CLETI) and Urdu printed text images (UPTI) are used. LSTM-RNN reveals consistent results on both datasets, and outperforms CNN. Maximum 99% accuracy is achieved using LSTM-RNN.
Conference Paper
Bidirectional LSTM-RNN have become one of the standard methods for sequence learning, especially in the context of OCR due to its ability to process unsegmented data and its inherent statistical language modeling [5]. It has recently been shown that training LSTM-RNNs even with imperfect transcriptions can lead to improved transcription results [7, 14]. The statistical nature of the LSTM's inherent language modeling can compensate for some of the errors in the ground truth and learn the correct temporal relations. In this paper we systematically explore the limits of the LSTM's language modeling ability by comparing the impact of imperfect transcriptions with various hand crafted error types and real erroneous data created through segmentation and clustering. We show that training LSTM-RNN with imperfect transcriptions can produce useful OCR models even if the ground truth error is up to 20%. Further we show that it can compensate for some handcrafted error types with error rates of up to 40% almost perfectly.
Conference Paper
Historical data sources, like medical records or biological collections, consist of unstructured heterogeneous content: handwritten text, different sizes and types of fonts, and text overlapped with lines, images, stamps, and sketches. The information these documents can provide is important, from a historical perspective and mainly because we can learn from it. The automatic digitization of these historical documents is a complex machine learning process that usually produces poor results, requiring costly interventions by experts, who have to transcribe and interpret the content. This paper describes hybrid (Human- and Machine-Intelligent) workflows for scientific data extraction, combining machine-learning and crowdsourcing software elements. Our results demonstrate that the mix of human and machine processes has advantages in data extraction time and quality, when compared to a machine-only workflow. More specifically, we show how OCRopus and Tesseract, two widely used open source Optical Character Recognition (OCR) tools, can improve their accuracy by more than 42%, when text areas are cropped by humans prior to OCR, while the total time can increase or decrease depending on the OCR selection. The digitization of 400 images, with Entomology, Bryophyte, and Lichen specimens, is evaluated following four different approaches: processing the whole specimen image (machine-only), processing crowd cropped labels (hybrid), processing crowd cropped fields (hybrid), and cleaning the machine-only output. As a secondary result, our experiments reveal differences in speed and quality between Tesseract and OCRopus.
Article
Zusammenfassung Im Rahmen des BMBF-geförderten Projekts KALLIMACHOS an der Universität Würzburg soll unter anderem die Textgrundlage für digitale Editionen per OCR gewonnen werden. Das Bearbeitungskorpus besteht aus deutschen, französischen und lateinischen Inkunabeln. Dieser Artikel zeigt, wie man mit bereits heute existierenden Methoden und Programmen den Problemen bei der OCR von Inkunabeln entgegentreten kann. Hierzu wurde an der Universitätsbibliothek Würzburg ein Verfahren erprobt, mit dem auf ausgewählten Werken einer Druckerwerkstatt bereits Zeichengenauigkeiten von bis zu 95 Prozent und Wortgenauigkeiten von bis zu 73 Prozent erzielt werden.
Article
This article describes the results of a case study to apply Optical Character Recognition (OCR) to scanned images of books printed between 1487 and 1870 by training the OCR engine OCRopus (Breuel et al. 2013) on the RIDGES herbal text corpus (Odebrecht et al., submitted). The resulting machine-readable text has character accuracies (percentage of correctly recognized characters) from 94% to more than 99% for even the earliest printed books, which were thought to be inaccessible by OCR methods until recently. Training specific OCR models was possible because the necessary "ground truth" has been available as error-corrected diplomatic transcriptions. The OCR results have been evaluated for accuracy against the ground truth of unseen test sets. Furthermore, mixed OCR models trained on a subset of books have been tested for their predictive power on page images of other books in the corpus, mostly yielding character accuracies well above 90%. It therefore seems possible to construct generalized models covering a range of fonts that can be applied to a wide variety of historical printings. A moderate postcorrection effort of some pages will then enable the training of individual models with even better accuracies. Using this method, diachronic corpora including early printings can be constructed much faster and cheaper than by manual transcription. The OCR methods reported here open up the possibility of transforming our printed textual cultural heritage into electronic text by largely automatic means, which is a prerequisite for the mass conversion of scanned books.
Thesis
Full-text available
The task of printed Optical Character Recognition (OCR) is considered a “solved” issue by many Pattern Recognition (PR) researchers. The notion, however, partially true, does not represent the whole picture. Although, it is true that state-of-the-art OCR systems for many scripts exist, for example, for Latin, Greek, Han (Chinese), and Kana (Japanese), there is still a need for exhaustive research for many other challeng- ing modern scripts. Example of such scripts are: cursive Nabataean, which include Arabic, Persian, and Urdu; and the Brahamic family of scripts, which contain Devana- gari, Sanskrit, and its derivatives. These scripts present many challenging issues for OCR, for example, change in shape of character within a word depending upon its lo- cation, kerning, and a huge number of ligatures. Moreover, OCR research for histori- cal documents still requires much probing; therefore, efforts are required to develop robust OCR systems to preserve the literary heritage. Likewise, there is a need to address the issue of OCR of multilingual documents. Plenty of multilingual documents exist in the current time of globalization, which has increased the influence of different languages on each other. There is an increase in the usage of foreign words and phrases in articles, newspapers, and books, which are generating a large body of multilingual literature everyday. Another effect is seen in the products we use in our daily lives. From packaging of imported food items to so- phisticated electronics, the demand of international customers to access information about these products in their native language is ever increasing. The use of multilin- gual operational manuals, books, and dictionaries motivates the need to have multi- lingual OCR systems for their digitization. The aim of this thesis is to find the answers to some of these challenges using the contemporary machine learning methodologies, especially the Recurrent Neural Networks (RNN). Specifically, a recent architecture of these networks, referred to as Long Short-Term Memory (LSTM) networks, has been employed to OCR modern as well historical documents. The excellent OCR results obtained on these documents encourage us to extend their application to the field of multilingual OCR. The LSTM networks are first evaluated on standard English datasets to benchmark their performance. They yield better recognition results than any other contempo- rary OCR techniques without using sophisticated features and language modeling. Therefore, their application is further extended to more complex scripts that include Urdu Nastaleeq and Devanagari. For Urdu Nastaleeq script, LSTM networks achievethe best reported OCR results (2.55% Character Error Rate (CER)) on a publicly avail- able data set, while for Devanagari script, a new freely available database has been introduced on which CER of 9% is achieved. The LSTM-based methodology is further extended to the OCR of historical docu- ments. In this regard, this thesis focuses on Old German Fraktur script, medieval Latin script of the 15 th century, and the Polytonic Greek script. LSTM-based systems out- perform the contemporary OCR systems on all of these scripts. For old documents, it is usually very hard to prepare transcribed dataset for training a neural network in su- pervised learning paradigm. A novel methodology has been proposed by combining segmentation-based and segmentation-free approaches to OCR scripts for which no transcribed training data is available. For German Fraktur and Polytonic Greek scripts, artificially generated data from existing text corpora yield highly promising results (CER of <1% and 5% for Fraktur and Polytonic Greek scripts respectively). Another major contribution of this thesis is an efficient Multilingual OCR (MOCR) system, which has been achieved in two ways. Firstly, a sequence-learning based script identification system has been developed that works at the text-line level; thereby, eliminating the need to segment individual characters or words prior to actual script identification. And secondly, a unified approach for dealing with different types of multilingual documents has been proposed. The core motivation behind this gen- eralized framework is the reading ability of the human mind to process multilingual documents, where no script identification takes place. In this design, the LSTM net- works recognize multiple scripts simultaneously without the need to identify differ- ent scripts. The first step in building this framework is the realization of a language independent OCR system that recognizes multilingual text in a single step, using a sin- gle Long Short-Term Memory (LSTM) model. The language independent approach is then extended to a script independent OCR framework that can recognize multiscript documents using a single OCR model. The proposed generalized approach yields low error rate (1.2%) on a database of English-Greek bilingual documents. In summary, this thesis aims to extend the OCR research, from modern Latin scripts to old Latin, to Greek and to other “underprivileged” scripts such as Devanagari and Urdu Nastaleeq. It also provides a generalized OCR framework in dealing with multi- lingual documents.
Conference Paper
Full-text available
Tesseract engine supports multilingual text recognition. However, the recognition of cursive scripts using Tesseract is a challenging task. In this paper, Tesseract engine is analyzed and modified for the recognition of Nastalique writing style for Urdu language which is a very complex and cursive writing style of Arabic script. Original Tesseract system has 65.59% and 65.84% accuracies for 14 and 16 font sizes respectively, whereas the modified system, with reduced search space, gives 97.87% and 97.71% accuracies respectively. The efficiency is also improved from an average of 170 milliseconds (ms) to an average of 84 ms for the recognition of Nastalique document images.
Conference Paper
Full-text available
Long Short-Term Memory (LSTM) networks are powerful sequence learning machines. Their context-aware processing makes them a suitable candidate for segmentation-free Optical Character Recognition (OCR) tasks, where recognition is done on text-line level. These networks have shown promising results for cursive scripts where individual characters change their shapes based on their position in a word or a ligature. In this paper, we report the results of applying these networks to Devanagari script, where each consonant-consonant conjuncts and consonant-vowel combinations take different forms based on their position in the word. We also introduce a new database, Deva-DB, of Devanagari script (free of cost) to aid the research towards a robust Devanagari OCR system. On this database, LSTM-based OCRopus systems yield error rates ranging from 1.2% to 9.0% depending upon the complexity of the training and test data.Comparison with open-source Tesseract system is also presented for the same database.
Conference Paper
Full-text available
Recurrent neural networks (RNN) have been suc-cessfully applied for recognition of cursive handwritten docu-ments, both in English and Arabic scripts. Ability of RNNs to model context in sequence data like speech and text makes them a suitable candidate to develop OCR systems for printed Nabataean scripts (including Nastaleeq for which no OCR system is available to date). In this work, we have presented the results of applying RNN to printed Urdu text in Nastaleeq script. Bidirectional Long Short Term Memory (BLSTM) architecture with Connectionist Temporal Classification (CTC) output layer was employed to recognize printed Urdu text. We evaluated BLSTM networks for two cases: one ignoring the character's shape variations and the second is considering them. The recognition error rate at character level for first case is 5.15% and for the second is 13.6%. These results were obtained on synthetically generated UPTI dataset containing artificially degraded images to reflect some real-world scanning artefacts along with clean images. Comparison with shape-matching based method is also presented.
Conference Paper
Full-text available
Long Short-Term Memory (LSTM) networks have yielded excellent results on handwriting recognition. This paper describes an application of bidirectional LSTM networks to the problem of machine-printed Latin and Fraktur recognition. Latin and Fraktur recognition differs significantly from handwriting recognition in both the statistical properties of the data, as well as in the required, much higher levels of accuracy. Applications of LSTM networks to handwriting recognition use two-dimensional recurrent networks, since the exact position and baseline of handwritten characters is variable. In contrast, for printed OCR, we used a one-dimensional recurrent network combined with a novel algorithm for baseline and x-height normalization. A number of databases were used for training and testing, including the UW3 database, artificially generated and degraded Fraktur text and scanned pages from a book digitization project. The LSTM architecture achieved 0:6% character-level test-set error on English text. When the artificially degraded Fraktur data set is divided into training and test sets, the system achieves an error rate of 1:64%. On specific books printed in Fraktur (not part of the training set), the system achieves error rates of 0:15% (Fontane) and 1:47% (Ersch-Gruber). These recognition accuracies were found without using any language modelling or any other post-processing techniques.
Conference Paper
Full-text available
Many real-world sequence learning tasks re- quire the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their out- puts into label sequences, their applicability has so far been limited. This paper presents a novel method for training RNNs to label un- segmented sequences directly, thereby solv- ing both problems. An experiment on the TIMIT speech corpus demonstrates its ad- vantages over both a baseline HMM and a hybrid HMM-RNN.
Article
Full-text available
Basic backpropagation, which is a simple method now being widely used in areas like pattern recognition and fault diagnosis, is reviewed. The basic equations for backpropagation through time, and applications to areas like pattern recognition involving dynamic systems, systems identification, and control are discussed. Further extensions of this method, to deal with systems other than neural networks, systems involving simultaneous equations, or true recurrent networks, and other practical issues arising with the method are described. Pseudocode is provided to clarify the algorithms. The chain rule for ordered derivatives-the theorem which underlies backpropagation-is briefly discussed. The focus is on designing a simpler version of backpropagation which can be translated into computer code and applied directly by neutral network users
Article
In this paper, we present an Arabic handwriting recognition method based on recurrent neural network. We use the Long Short Term Memory (LSTM) architecture, that have proven successful in different printed and handwritten OCR tasks. Applications of LSTM for handwriting recognition employ the two-dimensional architecture to deal with the variations in both vertical and horizontal axis. However, we show that using a simple pre-processing step that normalizes the position and baseline of letters, we can make use of 1D LSTM, which is faster in learning and convergence, and yet achieve superior performance. In a series of experiments on IFN/ENIT database for Arabic handwriting recognition, we demonstrate that our proposed pipeline can outperform 2D LSTM networks. Furthermore, we provide comparisons with 1D LSTM networks trained with manually crafted features to show that the automatically learned features in a globally trained 1D LSTM network with our normalization step can even outperform such systems.
Conference Paper
This paper reports on high-performance Optical Character Recognition (OCR)experiments using Long Short-Term Memory (LSTM) Networks for Greek polytonic script. Even though there are many Greek polytonic manuscripts, the digitization of such documents has not been widely applied, and very limited work has been done on the recognition of such scripts. We have collected a diverse number of document pages of Greek polytonic scripts in a novel database, called Polyton-DB, containing 15,689 textlines of synthetic and authentic printed scripts and performed baseline experiments using LSTM Networks. Evaluation results showed that the character error rate obtained with LSTM varies from 5.51% to 14.68% and outperforms two well-known OCR engines, namely, Tesseract and ABBYY FineReader.
Conference Paper
The Tesseract OCR engine, as was the HP Research Prototype in the UNLV Fourth Annual Test of OCR Accuracy, is described in a comprehensive overview. Emphasis is placed on aspects that are novel or at least unusual in an OCR engine, including in particular the line finding, features/classification methods, and the adaptive classifier.
Training Tesseract for Ancient Greek OCR
  • N White
N. White, "Training Tesseract for Ancient Greek OCR," Eutypon, pp. 1-11, 2013.