Conference Paper

OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters

April 2016

April 2016

DOI:10.1109/DAS.2016.51

Conference: DAS 2016, 12th Int’l IAPR Workshop on Document Analysis Systems
At: Santorini, Greece

Authors:

Andreas Dengel

Deutsches Forschungszentrum für Künstliche Intelligenz

Adnan Ul-Hasan

RPTU - Rheinland-Pfälzische Technische Universität Kaiserslautern Landau

Syed Saqib Bukhari

Document image analysis of Balinese palm leaf manuscripts

Thesis

Jul 2018

Made Windu Antara Kesiman

The collection of palm leaf manuscripts is an important part of Southeast Asian people’s culture and life. Following the increasing of the digitization projects of heritage documents around the world, the collection of palm leaf manuscripts in Southeast Asia finally attracted the attention of researchers in document image analysis (DIA). The research work conducted for this dissertation focused on the heritage documents of the collection of palm leaf manuscripts from Indonesia, especially the palm leaf manuscripts from Bali. This dissertation took part in exploring DIA researches for palm leaf manuscripts collection. This collection offers new challenges for DIA researches because it uses palm leaf as writing media and also with a language and script that have never been analyzed before. Motivated by the contextual situations and real conditions of the palm leaf manuscript collections in Bali, this research tried to bring added value to digitized palm leaf manuscripts by developing tools to analyze, to transliterate and to index the content of palm leaf manuscripts. These systems aim at making palm leaf manuscripts more accessible, readable and understandable to a wider audience and, to scholars and students all over the world. This research developed a DIA system for document images of palm leaf manuscripts, that includes several image processing tasks, beginning with digitization of the document, ground truth construction, binarization, text line and glyph segmentation, ending with glyph and word recognition, transliteration and document indexing and retrieval. In this research, we created the first corpus and dataset of the Balinese palm leaf manuscripts for the DIA research community. We also developed the glyph recognition system and the automatic transliteration system for the Balinese palm leaf manuscripts. This dissertation proposed a complete scheme of spatially categorized glyph recognition for the transliteration of Balinese palm leaf manuscripts. The proposed scheme consists of six tasks: the text line and glyph segmentation, the glyph ordering process, the detection of the spatial position for glyph category, the global and categorized glyph recognition, the option selection for glyph recognition and the transliteration with phonological rules-based machine. An implementation of knowledge representation and phonological rules for the automatic transliteration of Balinese script on palm leaf manuscript is proposed. The adaptation of a segmentation-free LSTM-based transliteration system with the generated synthetic dataset and the training schemes at two different levels (word level and text line level) is also proposed.

Comparative analysis of raw images and meta feature based Urdu OCR using CNN and LSTM

Article

Full-text available

Jan 2018

Abstract: Urdu language uses cursive script which results in connected characters constituting ligatures. For identifying characters within ligatures of different scales (font sizes), Convolution Neural Network (CNN) and Long Short Term Memory (LSTM) Network are used. Both network models are trained on formerly extracted ligature thickness graphs, from which models extract Meta features. These thickness graphs provide consistent information across different font sizes. LSTM and CNN are also trained on raw images to compare performance on both forms of inputs. For this research, two corpora, i.e. Urdu Printed Text Images (UPTI) and Centre for Language Engineering (CLE) Text Images are used. Overall performance of networks ranges between 90% and 99.8%. Average accuracy on Meta features is 98.08% while using raw images, 97.07% average accuracy is achieved.

Segmentation-free Optical Character Recognition for Printed Urdu Text

Article

Full-text available

Sep 2017
Int J Image Video Process

Israr Uddin Khattak

This paper presents a segmentation-free optical character recognition system for printed Urdu Nastaliq font using ligatures as units of recognition. The proposed technique relies on statistical features and employs Hidden Markov Models for classification. A total of 1525 unique high-frequency Urdu ligatures from the standard Urdu Printed Text Images (UPTI) database are considered in our study. Ligatures extracted from text lines are first split into primary (main body) and secondary (dots and diacritics) ligatures and multiple instances of the same ligature are grouped into clusters using a sequential clustering algorithm. Hidden Markov Models are trained separately for each ligature using the examples in the respective cluster by sliding right-to-left the overlapped windows and extracting a set of statistical features. Given the query text, the primary and secondary ligatures are separately recognized and later associated together using a set of heuristics to recognize the complete ligature. The system evaluated on the standard UPTI Urdu database reported a ligature recognition rate of 92% on more than 6000 query ligatures

Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 31e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 1 : Journées d'Études sur la Parole

Book

Jan 2020

Transkribus. A Platform for Automated Text Recognition and Searching of Historical Documents

Conference Paper

Sep 2019

Multi-font Devanagari Text Recognition Using LSTM Neural Networks

Chapter

Full-text available

Jan 2020

Current research in OCR Kundaikar, Teja focusing on the effect of multi-font and multi-size text on Pawar, Jyoti D. accuracy. To the best of our knowledge, no study has been carried out to study the effect of multi-fonts and multi-size text on the accuracy of Devanagari OCRs. The most popular Devanagari OCRs in the market today are Tesseract OCR, Indsenz OCR and eAksharayan OCR. In this research work, we have studied the effect of font styles, namely Nakula, Baloo, Dekko, Biryani and Aparajita on these three OCRs. It has been observed that the accuracy of the Devanagari OCRs is dependent on the type of font style in text document images. Hence, we have proposed a multi-font Devanagari OCR (MFD_OCR), text line recognition model using long short-term memory (LSTM) neural networks. We have created training dataset Multi_Font_Train, which consists of text document images and its corresponding text file. This consists of each text line in five different font styles, namely Nakula, Baloo, Dekko, Biryani and Aparajita. The test dataset is created using the text from benchmark dataset [1] for each of the font styles as mentioned above, and they are named as BMT_Nakula, BMT_Baloo, BMT_Dekko, BMT_Biryani and BMT_Aparajita test dataset. On the evaluation of all OCRs, the MFD_OCR showed consistent accuracy across all these test datasets. It obtained comparatively good accuracy for BMT_Dekko and BMT_Biryani test datasets. On performing detailed error analysis, we noticed that compared to other Devanagari OCRs, the MFD_OCR has consistent, insertion and deletion type of errors, across all test dataset for each font style. The deletion errors are negligible, ranging from 0.8 to 1.4%.

Transforming Scholarship in the Archives Through Handwritten Text Recognition: Transkribus as a Case Study

Article

Jul 2019
J DOC

Purpose An overview of the current use of handwritten text recognition (HTR) on archival manuscript material, as provided by the EU H2020 funded Transkribus platform. It explains HTR, demonstrates Transkribus , gives examples of use cases, highlights the affect HTR may have on scholarship, and evidences this turning point of the advanced use of digitised heritage content. The paper aims to discuss these issues. Design/methodology/approach This paper adopts a case study approach, using the development and delivery of the one openly available HTR platform for manuscript material. Findings Transkribus has demonstrated that HTR is now a useable technology that can be employed in conjunction with mass digitisation to generate accurate transcripts of archival material. Use cases are demonstrated, and a cooperative model is suggested as a way to ensure sustainability and scaling of the platform. However, funding and resourcing issues are identified. Research limitations/implications The paper presents results from projects: further user studies could be undertaken involving interviews, surveys, etc. Practical implications Only HTR provided via Transkribus is covered: however, this is the only publicly available platform for HTR on individual collections of historical documents at time of writing and it represents the current state-of-the-art in this field. Social implications The increased access to information contained within historical texts has the potential to be transformational for both institutions and individuals. Originality/value This is the first published overview of how HTR is used by a wide archival studies community, reporting and showcasing current application of handwriting technology in the cultural heritage sector.

Segmentation-Free Speech Text Recognition for Comic Books

Conference Paper

Full-text available

Nov 2017

Automatic quality evaluation and (semi-) automatic improvement of mixed models for OCR on historical documents

Article

Full-text available

Jun 2016

Good OCR results on historical documents rely on diplomatic transcriptions of printed material as ground truth which is both a scarce resource and time-consuming to generate. A strategy is proposed which starts from a mixed model trained on already available transcriptions from different centuries giving accuracies over 90% on a test set from the same period of time, overcoming the typography barrier of having to train individual models separately for each historical typeface. It is shown that both mean character confidence (as output by the OCR engine OCRopus) and lexicality (a measure of correctness of OCR tokens compared to a lexicon of modern wordforms taking historical spelling patterns into account, which can be calculated for any OCR engine) correlate with true accuracy determined from a comparison of OCR results with ground truth. These measures are then used to guide the training of new individual OCR models either using OCR prediction as pseudo ground truth (fully automatic method) or choosing a minimum set of hand-corrected lines as training material (manual method). Already 40-80 hand- corrected lines lead to OCR results with character error rates of only a few percent. This procedure minimizes the amount of ground truth production and does not depend on the previous construction of a specific typographic model.

Digitizing History: Transitioning Historical Paper Documents to Digital Content for Information Retrieval and Mining—A Comprehensive Survey

Article

Jan 2024

Historical document processing (HDP) corresponds to the task of converting the physical-bind form of historical archives into a web-based centrally digitized form for their conservation , preservation , and ubiquitous access . Besides the conservation of these invaluable historical collections, the key agenda is to make these geographically distributed historical repositories available for information mining and retrieval in a web-centralized touchless mode . Being a matter of interest for interdisciplinary scholars, the endeavor has garnered the attention of many researchers resulting in an immense body of the literature dedicated to digitization strategies. The present study first assembles the prevalent tasks essential for HDP into a pipeline and frames an outline for a generic workflow for historical document digitization. Then, it reports the latest task-specific state of the art which gives a brief discourse on the methods and open challenges in handling historical printed and handwritten script images. Next, grounded on various layout attributes, it further talks about the evaluation metrics and datasets available for observational and analytical purposes. The current study is an attempt to trail the contours of undergoing research and its bottlenecks thus, providing readers with a comprehensive view and understanding of existing studies and unfolding the open avenues for the future outlook.

Historical Document Processing: Historical Document Processing: A Survey of Techniques, Tools, and Trends

Preprint

Feb 2020

Historical Document Processing is the process of digitizing written material from the past for future use by historians and other scholars. It incorporates algorithms and software tools from various subfields of computer science, including computer vision, document analysis and recognition, natural language processing, and machine learning, to convert images of ancient manuscripts, letters, diaries, and early printed texts automatically into a digital format usable in data mining and information retrieval systems. Within the past twenty years, as libraries, museums, and other cultural heritage institutions have scanned an increasing volume of their historical document archives, the need to transcribe the full text from these collections has become acute. Since Historical Document Processing encompasses multiple sub-domains of computer science, knowledge relevant to its purpose is scattered across numerous journals and conference proceedings. This paper surveys the major phases of, standard algorithms, tools, and datasets in the field of Historical Document Processing, discusses the results of a literature review, and finally suggests directions for further research.

An Investigative Analysis of Different LSTM Libraries for Supervised and Unsupervised Architectures of OCR Training

Conference Paper

Aug 2018

Transcription Free LSTM OCR Model Evaluation

Conference Paper

Aug 2018

Meta features-based scale invariant OCR decision making using LSTM-RNN

Article

Full-text available

Jun 2019
Comput Math Organ Theor

Urdu optical character recognition (OCR) is a complex problem due to the nature of its script, which is cursive. Recognizing characters of different font sizes further complicates the problem. In this research, long short term memory-recurrent neural network (LSTM-RNN) and convolution neural network (CNN) are used to recognize Urdu optical characters of different font sizes. LSTM-RNN is trained on formerly extracted feature sets, which are extracted for scale invariant recognition of Urdu characters. From these features, LSTM-RNN extracts meta features. CNN is trained on raw binary images. Two benchmark datasets, i.e. centre for language engineering text images (CLETI) and Urdu printed text images (UPTI) are used. LSTM-RNN reveals consistent results on both datasets, and outperforms CNN. Maximum 99% accuracy is achieved using LSTM-RNN.

Training LSTM-RNN with Imperfect Transcription: Limitations and Outcomes

Conference Paper

Nov 2017

Bidirectional LSTM-RNN have become one of the standard methods for sequence learning, especially in the context of OCR due to its ability to process unsegmented data and its inherent statistical language modeling [5]. It has recently been shown that training LSTM-RNNs even with imperfect transcriptions can lead to improved transcription results [7, 14]. The statistical nature of the LSTM's inherent language modeling can compensate for some of the errors in the ground truth and learn the correct temporal relations. In this paper we systematically explore the limits of the LSTM's language modeling ability by comparing the impact of imperfect transcriptions with various hand crafted error types and real erroneous data created through segmentation and clustering. We show that training LSTM-RNN with imperfect transcriptions can produce useful OCR models even if the ground truth error is up to 20%. Further we show that it can compensate for some handcrafted error types with error rates of up to 40% almost perfectly.

Cooperative human-machine data extraction from biological collections

Conference Paper

Oct 2016

Historical data sources, like medical records or biological collections, consist of unstructured heterogeneous content: handwritten text, different sizes and types of fonts, and text overlapped with lines, images, stamps, and sketches. The information these documents can provide is important, from a historical perspective and mainly because we can learn from it. The automatic digitization of these historical documents is a complex machine learning process that usually produces poor results, requiring costly interventions by experts, who have to transcribe and interpret the content. This paper describes hybrid (Human- and Machine-Intelligent) workflows for scientific data extraction, combining machine-learning and crowdsourcing software elements. Our results demonstrate that the mix of human and machine processes has advantages in data extraction time and quality, when compared to a machine-only workflow. More specifically, we show how OCRopus and Tesseract, two widely used open source Optical Character Recognition (OCR) tools, can improve their accuracy by more than 42%, when text areas are cropped by humans prior to OCR, while the total time can increase or decrease depending on the OCR selection. The digitization of 400 images, with Entomology, Bryophyte, and Lichen specimens, is evaluated following four different approaches: processing the whole specimen image (machine-only), processing crowd cropped labels (hybrid), processing crowd cropped fields (hybrid), and cleaning the machine-only output. As a secondary result, our experiments reveal differences in speed and quality between Tesseract and OCRopus.

OCR bei Inkunabeln – Offizinspezifischer Ansatz der Universitätsbibliothek Würzburg

Article

Sep 2016

Zusammenfassung Im Rahmen des BMBF-geförderten Projekts KALLIMACHOS an der Universität Würzburg soll unter anderem die Textgrundlage für digitale Editionen per OCR gewonnen werden. Das Bearbeitungskorpus besteht aus deutschen, französischen und lateinischen Inkunabeln. Dieser Artikel zeigt, wie man mit bereits heute existierenden Methoden und Programmen den Problemen bei der OCR von Inkunabeln entgegentreten kann. Hierzu wurde an der Universitätsbibliothek Würzburg ein Verfahren erprobt, mit dem auf ausgewählten Werken einer Druckerwerkstatt bereits Zeichengenauigkeiten von bis zu 95 Prozent und Wortgenauigkeiten von bis zu 73 Prozent erzielt werden.

OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus

Article

Aug 2016

This article describes the results of a case study to apply Optical Character Recognition (OCR) to scanned images of books printed between 1487 and 1870 by training the OCR engine OCRopus (Breuel et al. 2013) on the RIDGES herbal text corpus (Odebrecht et al., submitted). The resulting machine-readable text has character accuracies (percentage of correctly recognized characters) from 94% to more than 99% for even the earliest printed books, which were thought to be inaccessible by OCR methods until recently. Training specific OCR models was possible because the necessary "ground truth" has been available as error-corrected diplomatic transcriptions. The OCR results have been evaluated for accuracy against the ground truth of unseen test sets. Furthermore, mixed OCR models trained on a subset of books have been tested for their predictive power on page images of other books in the corpus, mostly yielding character accuracies well above 90%. It therefore seems possible to construct generalized models covering a range of fonts that can be applied to a wide variety of historical printings. A moderate postcorrection effort of some pages will then enable the training of individual models with even better accuracies. Using this method, diachronic corpora including early printings can be constructed much faster and cheaper than by manual transcription. The OCR methods reported here open up the possibility of transforming our printed textual cultural heritage into electronic text by largely automatic means, which is a prerequisite for the mass conversion of scanned books.

Generic Text Recognition using Long Short-Term Memory Networks

Thesis

Full-text available

Jan 2016

Adnan Ul-Hasan

The task of printed Optical Character Recognition (OCR) is considered a “solved” issue by many Pattern Recognition (PR) researchers. The notion, however, partially true, does not represent the whole picture. Although, it is true that state-of-the-art OCR systems for many scripts exist, for example, for Latin, Greek, Han (Chinese), and Kana (Japanese), there is still a need for exhaustive research for many other challeng- ing modern scripts. Example of such scripts are: cursive Nabataean, which include Arabic, Persian, and Urdu; and the Brahamic family of scripts, which contain Devana- gari, Sanskrit, and its derivatives. These scripts present many challenging issues for OCR, for example, change in shape of character within a word depending upon its lo- cation, kerning, and a huge number of ligatures. Moreover, OCR research for histori- cal documents still requires much probing; therefore, efforts are required to develop robust OCR systems to preserve the literary heritage. Likewise, there is a need to address the issue of OCR of multilingual documents. Plenty of multilingual documents exist in the current time of globalization, which has increased the influence of different languages on each other. There is an increase in the usage of foreign words and phrases in articles, newspapers, and books, which are generating a large body of multilingual literature everyday. Another effect is seen in the products we use in our daily lives. From packaging of imported food items to so- phisticated electronics, the demand of international customers to access information about these products in their native language is ever increasing. The use of multilin- gual operational manuals, books, and dictionaries motivates the need to have multi- lingual OCR systems for their digitization. The aim of this thesis is to find the answers to some of these challenges using the contemporary machine learning methodologies, especially the Recurrent Neural Networks (RNN). Specifically, a recent architecture of these networks, referred to as Long Short-Term Memory (LSTM) networks, has been employed to OCR modern as well historical documents. The excellent OCR results obtained on these documents encourage us to extend their application to the field of multilingual OCR. The LSTM networks are first evaluated on standard English datasets to benchmark their performance. They yield better recognition results than any other contempo- rary OCR techniques without using sophisticated features and language modeling. Therefore, their application is further extended to more complex scripts that include Urdu Nastaleeq and Devanagari. For Urdu Nastaleeq script, LSTM networks achievethe best reported OCR results (2.55% Character Error Rate (CER)) on a publicly avail- able data set, while for Devanagari script, a new freely available database has been introduced on which CER of 9% is achieved. The LSTM-based methodology is further extended to the OCR of historical docu- ments. In this regard, this thesis focuses on Old German Fraktur script, medieval Latin script of the 15 th century, and the Polytonic Greek script. LSTM-based systems out- perform the contemporary OCR systems on all of these scripts. For old documents, it is usually very hard to prepare transcribed dataset for training a neural network in su- pervised learning paradigm. A novel methodology has been proposed by combining segmentation-based and segmentation-free approaches to OCR scripts for which no transcribed training data is available. For German Fraktur and Polytonic Greek scripts, artificially generated data from existing text corpora yield highly promising results (CER of <1% and 5% for Fraktur and Polytonic Greek scripts respectively). Another major contribution of this thesis is an efficient Multilingual OCR (MOCR) system, which has been achieved in two ways. Firstly, a sequence-learning based script identification system has been developed that works at the text-line level; thereby, eliminating the need to segment individual characters or words prior to actual script identification. And secondly, a unified approach for dealing with different types of multilingual documents has been proposed. The core motivation behind this gen- eralized framework is the reading ability of the human mind to process multilingual documents, where no script identification takes place. In this design, the LSTM net- works recognize multiple scripts simultaneously without the need to identify differ- ent scripts. The first step in building this framework is the realization of a language independent OCR system that recognizes multilingual text in a single step, using a sin- gle Long Short-Term Memory (LSTM) model. The language independent approach is then extended to a script independent OCR framework that can recognize multiscript documents using a single OCR model. The proposed generalized approach yields low error rate (1.2%) on a database of English-Greek bilingual documents. In summary, this thesis aims to extend the OCR research, from modern Latin scripts to old Latin, to Greek and to other “underprivileged” scripts such as Devanagari and Urdu Nastaleeq. It also provides a generalized OCR framework in dealing with multi- lingual documents.

Adapting Tesseract for Complex Scripts: An Example for Urdu Nastalique

Conference Paper

Full-text available

Apr 2014

Tesseract engine supports multilingual text recognition. However, the recognition of cursive scripts using Tesseract is a challenging task. In this paper, Tesseract engine is analyzed and modified for the recognition of Nastalique writing style for Urdu language which is a very complex and cursive writing style of Arabic script. Original Tesseract system has 65.59% and 65.84% accuracies for 14 and 16 font sizes respectively, whereas the modified system, with reduced search space, gives 97.87% and 97.71% accuracies respectively. The efficiency is also improved from an average of 170 milliseconds (ms) to an average of 84 ms for the recognition of Nastalique document images.

A Segmentation-Free Approach for Printed Devanagari Script Recognition

Conference Paper

Full-text available

Aug 2015

Long Short-Term Memory (LSTM) networks are powerful sequence learning machines. Their context-aware processing makes them a suitable candidate for segmentation-free Optical Character Recognition (OCR) tasks, where recognition is done on text-line level. These networks have shown promising results for cursive scripts where individual characters change their shapes based on their position in a word or a ligature. In this paper, we report the results of applying these networks to Devanagari script, where each consonant-consonant conjuncts and consonant-vowel combinations take different forms based on their position in the word. We also introduce a new database, Deva-DB, of Devanagari script (free of cost) to aid the research towards a robust Devanagari OCR system. On this database, LSTM-based OCRopus systems yield error rates ranging from 1.2% to 9.0% depending upon the complexity of the training and test data.Comparison with open-source Tesseract system is also presented for the same database.

Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks

Conference Paper

Full-text available

Aug 2013

Recurrent neural networks (RNN) have been suc-cessfully applied for recognition of cursive handwritten docu-ments, both in English and Arabic scripts. Ability of RNNs to model context in sequence data like speech and text makes them a suitable candidate to develop OCR systems for printed Nabataean scripts (including Nastaleeq for which no OCR system is available to date). In this work, we have presented the results of applying RNN to printed Urdu text in Nastaleeq script. Bidirectional Long Short Term Memory (BLSTM) architecture with Connectionist Temporal Classification (CTC) output layer was employed to recognize printed Urdu text. We evaluated BLSTM networks for two cases: one ignoring the character's shape variations and the second is considering them. The recognition error rate at character level for first case is 5.15% and for the second is 13.6%. These results were obtained on synthetically generated UPTI dataset containing artificially degraded images to reflect some real-world scanning artefacts along with clean images. Comparison with shape-matching based method is also presented.

High-Performance OCR for Printed English and Fraktur using LSTM Networks

Conference Paper

Full-text available

Aug 2013

Long Short-Term Memory (LSTM) networks have yielded excellent results on handwriting recognition. This paper describes an application of bidirectional LSTM networks to the problem of machine-printed Latin and Fraktur recognition. Latin and Fraktur recognition differs signiﬁcantly from handwriting recognition in both the statistical properties of the data, as well as in the required, much higher levels of accuracy. Applications of LSTM networks to handwriting recognition use two-dimensional recurrent networks, since the exact position and baseline of handwritten characters is variable. In contrast, for printed OCR, we used a one-dimensional recurrent network combined with a novel algorithm for baseline and x-height normalization. A number of databases were used for training and testing, including the UW3 database, artiﬁcially generated and degraded Fraktur text and scanned pages from a book digitization project. The LSTM architecture achieved 0:6% character-level test-set error on English text. When the artiﬁcially degraded Fraktur data set is divided into training and test sets, the system achieves an error rate of 1:64%. On speciﬁc books printed in Fraktur (not part of the training set), the system achieves error rates of 0:15% (Fontane) and 1:47% (Ersch-Gruber). These recognition accuracies were found without using any language modelling or any other post-processing techniques.

Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural 'networks

Conference Paper

Full-text available

Jan 2006

Many real-world sequence learning tasks re- quire the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their out- puts into label sequences, their applicability has so far been limited. This paper presents a novel method for training RNNs to label un- segmented sequences directly, thereby solv- ing both problems. An experiment on the TIMIT speech corpus demonstrates its ad- vantages over both a baseline HMM and a hybrid HMM-RNN.

Backpropagation through time: what it does and how to do it

Article

Full-text available

Nov 1990

Paul Werbos

Basic backpropagation, which is a simple method now being widely used in areas like pattern recognition and fault diagnosis, is reviewed. The basic equations for backpropagation through time, and applications to areas like pattern recognition involving dynamic systems, systems identification, and control are discussed. Further extensions of this method, to deal with systems other than neural networks, systems involving simultaneous equations, or true recurrent networks, and other practical issues arising with the method are described. Pseudocode is provided to clarify the algorithms. The chain rule for ordered derivatives-the theorem which underlies backpropagation-is briefly discussed. The focus is on designing a simpler version of backpropagation which can be translated into computer code and applied directly by neutral network users

A comparison of 1D and 2D LSTM architectures for the recognition of handwritten Arabic

Article

Feb 2015
Proceedings of SPIE

In this paper, we present an Arabic handwriting recognition method based on recurrent neural network. We use the Long Short Term Memory (LSTM) architecture, that have proven successful in different printed and handwritten OCR tasks. Applications of LSTM for handwriting recognition employ the two-dimensional architecture to deal with the variations in both vertical and horizontal axis. However, we show that using a simple pre-processing step that normalizes the position and baseline of letters, we can make use of 1D LSTM, which is faster in learning and convergence, and yet achieve superior performance. In a series of experiments on IFN/ENIT database for Arabic handwriting recognition, we demonstrate that our proposed pipeline can outperform 2D LSTM networks. Furthermore, we provide comparisons with 1D LSTM networks trained with manually crafted features to show that the automatically learned features in a globally trained 1D LSTM network with our normalization step can even outperform such systems.

Recognition of Historical Greek Polytonic Scripts Using LSTM Networks

Conference Paper

Aug 2015

This paper reports on high-performance Optical Character Recognition (OCR)experiments using Long Short-Term Memory (LSTM) Networks for Greek polytonic script. Even though there are many Greek polytonic manuscripts, the digitization of such documents has not been widely applied, and very limited work has been done on the recognition of such scripts. We have collected a diverse number of document pages of Greek polytonic scripts in a novel database, called Polyton-DB, containing 15,689 textlines of synthetic and authentic printed scripts and performed baseline experiments using LSTM Networks. Evaluation results showed that the character error rate obtained with LSTM varies from 5.51% to 14.68% and outperforms two well-known OCR engines, namely, Tesseract and ABBYY FineReader.

An Overview of the Tesseract OCR Engine

Conference Paper

Oct 2007

R. Smith

The Tesseract OCR engine, as was the HP Research Prototype in the UNLV Fourth Annual Test of OCR Accuracy, is described in a comprehensive overview. Emphasis is placed on aspects that are novel or at least unusual in an OCR engine, including in particular the line finding, features/classification methods, and the adaptive classifier.

Training Tesseract for Ancient Greek OCR

Jan 2013
1-11

N White

N. White, "Training Tesseract for Ancient Greek OCR," Eutypon, pp. 1-11, 2013.

OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters

No full-text available

Recommended publications

An evolutive OCR system based on continuous learning