Fisheye lens example (from up-to-down): original image of 500 Â 250 pixels centered at the pixel to be classified; the same image with a fisheye lens distortion; the image downsampled to 50 Â 30 values used by the neural network classifier. 

Fisheye lens example (from up-to-down): original image of 500 Â 250 pixels centered at the pixel to be classified; the same image with a fisheye lens distortion; the image downsampled to 50 Â 30 values used by the neural network classifier. 

Source publication
Article
Full-text available
This paper proposes the use of hybrid Hidden Markov Model (HMM)/Artificial Neural Network (ANN) models for recognizing unconstrained offline handwritten texts. The structural part of the optical models has been modeled with Markov chains, and a Multilayer Perceptron is used to estimate the emission probabilities. This paper also presents new techni...

Context in source publication

Context 1
... experiments reported in this paper are conducted on handwritten text lines from the IAM database [32]. The version 3.0 of this database includes over 1,500 scanned forms of handwritten text from more than 650 different writers, for a total of more than 13,000 fully transcribed handwritten lines, without restrictions on the writing style or the writing instrument used. The sentences have been extracted from the Lancaster-Oslo/Bergen (LOB) text corpus [47]. A writer-independent text line recognition task has been considered. The subset of the IAM database used in this work consists of 6,161 training lines (from 283 writers), 920 validation lines (56 writers), and 2,781 test lines (161 writers). All of these data sets are disjoint, and no writer has contributed to more than one set. These partitions are the same as those used in several works by Bunke et al. [33], [48], [49]. A total of 87,967 instances of 11,320 distinct words occur in the union of the training, validation, and test sets. Lexicon is modeled with 78 characters: 26 lowercase letters, 26 uppercase letters, 10 digits, 14 punctuation marks, the space, and a character for garbage symbols. As described in Section 2.1, an MLP has been used for image cleaning by learning the appropriate filter from examples. Original noisy images from the IAM database and the same images that were cleaned by hand formed the training pairs. Additionally, artificially noised images (created by following the ideas presented in [34]) were also used as training data. In this case, the MLP is used for regression: The input is a fixed-sized moving window of 11 Â 11 pixels centered at the pixel to be cleaned, and the output is the restored value of the current pixel (see Fig. 1). The Enhancer-MLP has two hidden layers of 32 and 16 sigmoid units and one output linear unit. Training was performed using the stochastic version of the backpropagation algorithm with momentum term [46], using the mean-square error (MSE) function. The last column of Table 1 shows the topology of the MLPs which are used for preprocessing. As pointed out in Sections 2.2 and 2.4, two MLPs to classify local extrema as belonging to one of the reference lines (lower line, upper line, line of descenders, or line of ascenders) are needed as part of the slope removal and image size normalization processes. We needed supervised training patterns to train MLPs to classify interest points as belonging to the reference lines. A subset of 1,000 images from the IAM training set has been used. Local extrema of the 1,000 images were semi- automatically labeled using an active learning approach: First, a horizontal projection algorithm was used to classify the points belonging to each reference line of a subset of the 1,000 images; second, the subset of images was manually corrected using a graphical tool designed for this purpose [12]; third, these images were used to train an MLP to classify interest points. With this “pretrained” MLP, interest points of the 1,000 images were automatically classified, and, afterward, all of the images were manually supervised. At the end of this process, we had a training set composed of the interest points of the 1,000 images: 800 lines were used as training data and the remaining 200 lines were used as validation data. The Slope-MLP was trained to classify local extrema as belonging to or not belonging to the lower baseline. The Slope-MLP input is a moving window around the current pixel, being the choice of an appropriate window size, a trade-off between context and input size. To partially overcome this problem, we have opted to use a fisheye distortion centered at the pixel to classify (see Fig. 6 for an example) [12]. The fisheye distortion maintains a very accurate resolution near the center and, at the same time, has a much smaller size than using the original image. In this way, a detailed image near the interest point and a coarse representation of the relative position of the surrounding text is obtained: The input to this Slope-MLP is a window of 500 Â 250 pixels centered at the point to be classified, downsampled to 50 Â 30 values using the fisheye distortion. Two output units with a softmax activation function were used to determine whether or not the current pixel belonged to the lower baseline. After doing a scanning of topologies, two hidden layers of 64 and 16 sigmoid units were used. Size normalization was achieved by using a second MLP, which classifies the local extrema into five classes (the four reference lines and points not belonging to any of these lines). The input to this Normalize-MLP is the same as the Slope-MLP input and the output corresponded to five output units with softmax activation function. We used two hidden layers of 64 and 128 sigmoid units. Both MLPs, Slope-MLP and Normalize-MLP, were trained using the stochastic version of the backpropagation algorithm with momentum term and the cross-entropy error function. As described in Section 2.3, part of the process of slant removal needs an MLP to determine whether or not an image has slant. The same set of 1,000 images was manually slant-corrected in a nonuniform way by using a graphical tool. The user specifies a series of slant angles which are interpolated for every image column. This information is used to train the Slant-MLP. As before, 200 images were used for validation. Each image is sheared for different integer angles from À 50 to þ 50 and resized to 40 pixels height, preserving the aspect ratio. The input to this Slant-MLP is a square of 40 Â 40 pixels centered at the column to be evaluated, and the output is a measure of the local slant presence (shown as gray levels in Fig. 4). After doing a parameter and topology scanning, two hidden layers of 64 and 8 units were used. Training was performed using the stochastic version of the backpropagation algorithm with momentum term, using the mean-square error function. A word bigram language model was trained with three different text corpora: the LOB corpus [47] (excluding those sentences that contain lines from the test set of the IAM database), the Brown corpus [50], and the Wellington corpus [51]. . In Sentences: order to cope 51,560 with the LOB fact sentences that lines (2,134 are fragments sentences of sentences, which we have contained randomly IAM broken test lines each were sentence eliminated), from the corpus 51,763 into fragments Brown sentences, to resemble and lines. 20,592 All of Wellington this text is supplemented sentences. with the training lines from the IAM database. . Fragments Then, the of final sentences training to material resemble is lines: comprised More than of: 400,000 lines randomly obtained from the above set of sentences. . Lines: Finally, the 6,161 IAM training lines were also added. The bigram language model used in the recognition systems was generated, using the SRI Language Modeling Toolkit [52] with the modified Kneser-Ney back-off discounting. To achieve unconstrained handwriting recognition, an open dictionary, composed of the 20,000 most frequently occurring (case insensitive) words in the training material, was used to test our recognition systems. . Sentences: 51,560 LOB sentences (2,134 sentences which contained IAM test lines were eliminated), 51,763 Brown sentences, and 20,592 Wellington sentences. . Fragments of sentences to resemble lines: More than 400,000 lines randomly obtained from the above set of sentences. . Lines: Finally, the 6,161 IAM training lines were also added. The bigram language model used in the recognition systems was generated, using the SRI Language Modeling Toolkit [52] with the modified Kneser-Ney back-off discounting. To achieve unconstrained handwriting recognition, an open dictionary, composed of the 20,000 most frequently occurring (case insensitive) words in the training material, was used to test our recognition systems. The recognition performance was measured in terms of the Word Error Rate (WER), which is computed by comparing the output of the recognizer with the reference transcription. WER is defined as the number of word errors (insertions, substitutions, and deletions) summed over the whole test set and divided by the total number of words in the transcriptions of the reference set: A null WER is only reached if the recognizer output matches the reference transcription exactly. The Character Error Rate (CER) was also measured for the final test experiments. CER is defined as expression (5), but with characters instead of words. In order to properly compare different systems, it is highly desirable to provide not only the value of the WER (or CER) but also a confidence ...

Similar publications

Article
Full-text available
We introduce a weakly supervised approach for learning human actions modeled as interactions between humans and objects. Our approach is human-centric: We first localize a human in the image and then determine the object relevant for the action and its spatial relation with the human. The model is learned automatically from a set of still images an...

Citations

... Many attempts have been made over the years to develop precise HTR systems. Prior to 2013, many solutions were based on Hidden Markov Models (HMMs) as prevailing architectures [1][2][3]. From 2013 onwards, however, deep learning models have been considered as the standard methodology for offline text recognition. ...
Article
Full-text available
In the realm of offline handwritten text recognition, numerous normalization algorithms have been developed over the years to serve as preprocessing steps prior to applying automatic recognition models to handwritten text scanned images. These algorithms have demonstrated effectiveness in enhancing the overall performance of recognition architectures. However, many of these methods rely heavily on heuristic strategies that are not seamlessly integrated with the recognition architecture itself. This paper introduces the use of a Pix2Pix trainable model, a specific type of conditional generative adversarial network, as the method to normalize handwritten text images. Also, this algorithm can be seamlessly integrated as the initial stage of any deep learning architecture designed for handwritten recognition tasks. All of this facilitates training the normalization and recognition components as a unified whole, while still maintaining some interpretability of each module. Our proposed normalization approach learns from a blend of heuristic transformations applied to text images, aiming to mitigate the impact of intra-personal handwriting variability among different writers. As a result, it achieves slope and slant normalizations, alongside other conventional preprocessing objectives, such as normalizing the size of text ascenders and descenders. We will demonstrate that the proposed architecture replicates, and in certain cases surpasses, the results of a widely used heuristic algorithm across two metrics and when integrated as the first step of a deep recognition architecture.
... Each language model is built with the training transcriptions except for the IAM dataset. For IAM, we opted for a usual setup adopted in previous works [5,21,34,50,54,70,78] where the LM uses the training transcriptions combined with a composition of the corpora Brown [22], Wellington [3], and a filtered version of the LOB [30] corpus where samples matching the IAM test set were removed. Table 10 exhibits the results using the above setup. ...
Article
Full-text available
Off-line handwritten text recognition (HTR) poses a significant challenge due to the complexities of variable handwriting styles, background degradation, and unconstrained word sequences. This work tackles the handwritten text line recognition problem using octave convolutional recurrent neural networks (OctCRNN). Our approach requires no word segmentation, preprocessing, or explicit feature extraction and leverages octave convolutions to process multiscale features without increasing the number of learnable parameters. We investigate the OctCRNN under different settings, including an octave design that efficiently balances computational cost and recognition performance. We thoroughly investigate the OctCRNN under different settings by formulating an experimental pipeline with a visualization step to get intuitions about how the model works compared to a counterpart based on traditional convolutions. The system becomes complete by adding a language model to increase linguistic knowledge. Finally, we assess the performance of our solution using character and word error rates against established handwritten text recognition benchmarks: IAM, RIMES, and ICFHR 2016 READ. According to the results, our proposal achieves state-of-the-art performance while reducing the computational requirements. Our findings suggest that the architecture provides a robust framework for building HTR systems.
... Motivated by these observations, we propose to model the keystroke transition as a Hidden Markov Model (HMM) [53] and then predict extra input candidates to correct the mapping errors. An HMM can be used to model a system where the system states are unobservable ("hidden") and there is an observable process whose outcomes are influenced by and used to infer the hidden state, HMMs have shown success in cases similar to ours, e.g., speech and text modeling [18], [15]. ...
... It is improbable to design an on the whole winner that could attain the best performance in terms of both speed and accuracy for all applications. Attempts to determine this dilemma have resulted in the development of hybrid systems (6)(7)(8)(9)(10)(11), which usually utilize the compensation from a variety of techniques and join these different methods in a more well-organized way. One of the vital factors for a booming image classification system is the classifier. ...
... T dif f = 0 s , where and represent the first and second largest entry in the ELM output vector, respectively, in common, the larger the value of T dif f is, the better the classification boundary tends to be. Note that T dif f has been first applied as a criterion in [6] for noisy image partition, and it has shown that clean input images are prone to have large T dif f in the ELM network output vector. ...
... Over the past few decades, many scholars have proposed different HTR systems and have made remarkable improvements. For instance, the Hidden Markov Model (HMM) [2] and an HMM-neural network hybrid [3] were implemented to recognize handwritten documents. However, due to the independence assumption of HMM, matching extracted features with labels has limitations, and there is a long-range input problem, even if it is slightly relaxed in the case of HMM-NN hybrid systems. ...
... Additionally, for feature extraction and classification sub-tasks, several machine learning techniques have been proposed, such as HMM, support vector machine (SVM), and neural networks [3,[16][17][18]. Recently, DNN-based models have been introduced and have shown promising results for the high-dimensional automatic feature map extraction and recognition of handwritten texts [8,9,[18][19][20][21][22]. The feature extraction and recognition tasks are performed in an end-to-end manner. ...
Article
Full-text available
Offline handwritten text recognition (HTR) is a long-standing research project for a wide range of applications, including assisting visually impaired users, humans and robot interactions, and the automatic entry of business documents. However, due to variations in writing styles, visual similarities between different characters, overlap between characters, and source document noise, designing an accurate and flexible HTR system is challenging. The problem becomes serious when the algorithm has a low learning capacity and when the text used is complex and has a lot of characters in the writing system, such as Ethiopic script. In this paper, we propose a new model that recognizes offline handwritten Ethiopic text using a gated convolution and stacked self-attention encoder–decoder network. The proposed model has a feature extraction layer, an encoder layer, and a decoder layer. The feature extraction layer extracts high-dimensional invariant feature maps from the input handwritten image. Using the extracted feature maps, the encoder and decoder layers transcribe the corresponding text. For the training and testing of the proposed model, we prepare an offline handwritten Ethiopic text-line dataset (HETD) with 2800 samples and a handwritten Ethiopic word dataset (HEWD) with 10,540 samples obtained from 250 volunteers. The experiment results of the proposed model on HETD show a 9.17 and 13.11 Character Error Rate (CER) and Word Error Rate (WER), respectively. However, the model on HEWD shows an 8.22 and 9.17 CER and WER, respectively. These results and the prepared datasets will be used as a baseline for future research.
... This study adds to the body of research concerned with various approaches for text representation such as GloVe, doc2vec, word2vec and BERT (Chen et al., 2022;Phan et al., 2023;Zhang et al., 2023). Moreover, the positive impact of using bigrams as a text representation technique is studied and proven by previous studies focusing on hate speech detection (Abro et al., 2020), handwritten text recognition (España-Boquera et al., 2011) and student dropout prediction based on textual data (Phan et al., 2023). Additionally, the use of Random Forest (RF) algorithm proves to support prediction and classification explainability as studied by (Ferrettini et al., 2022), which helps understand decisions made by artificial intelligencebased systems (Dennehy et al., 2022). ...
Article
Full-text available
Social media platforms have become an increasingly popular tool for individuals to share their thoughts and opinions with other people. However, very often people tend to misuse social media posting abusive comments. Abusive and harassing behaviours can have adverse effects on people's lives. This study takes a novel approach to combat harassment in online platforms by detecting the severity of abusive comments, that has not been investigated before. The study compares the performance of machine learning models such as Naïve Bayes, Random Forest, and Support Vector Machine, with deep learning models such as Convolutional Neural Network (CNN) and Bi-directional Long Short-Term Memory (Bi-LSTM). Moreover, in this work we investigate the effect of text pre-processing on the performance of the machine and deep learning models, the feature set for the abusive comments was made using unigrams and bigrams for the machine learning models and word embeddings for the deep learning models. The comparison of the models’ performances showed that the Random Forest with bigrams achieved the best overall performance with an accuracy of (0.94), a precision of (0.91), a recall of (0.94), and an F1 score of (0.92). The study develops an efficient model to detect severity of abusive language in online platforms, offering important implications both to theory and practice.
... When the pre-processing phase is finished, the CNN model is built as the next step. The CNN algorithm has four hidden layers that let it extract information from images so it can forecast the outcome [10]. ...
... For an overview of offline and online HWR datasets, see [29], [42]. For a more detailed overview, see Table 7 in the Appendix B. Methods for offline HWR range from hidden Markov models (HMMs) -such as [43]- [47] -to deep learning techniques that became predominant in 2014, such as convolutional neural networks (CNNs) as the methods by [48], [49]. Furthermore, temporal convolutional networks (TCNs) employ the temporal context of the handwriting -such as the methods [50], [51]. ...
Article
Full-text available
Cross-modal representation learning learns a shared embedding between two or more modalities to improve performance in a given task compared to using only one of the modalities. Cross-modal representation learning from different data types – such as images and time-series data (e.g., audio or text data) – requires a deep metric learning loss that minimizes the distance between the modality embeddings. In this paper, we propose to use the contrastive or triplet loss, which uses positive and negative identities to create sample pairs with different labels, for cross-modal representation learning between image and time-series modalities (CMR-IS). By adapting the triplet loss for cross-modal representation learning, higher accuracy in the main (time-series classification) task can be achieved by exploiting additional information of the auxiliary (image classification) task. We present a triplet loss with a dynamic margin for single label and sequence-to-sequence classification tasks. We perform extensive evaluations on synthetic image and time-series data, and on data for offline handwriting recognition (HWR) and on online HWR from sensor-enhanced pens for classifying written words. Our experiments show an improved classification accuracy, faster convergence, and better generalizability due to an improved cross-modal representation. Furthermore, the more suitable generalizability leads to a better adaptability between writers for online HWR.
... The trained model's predictions were then visualized using OpenCV. Salvador, et al. [8], the model they suggested a hybrid Hidden Markov Model (HMM) for identifying offline handwritten text in uncontrolled contexts. In this instance, the structural element of Markov chains was employed to simulate the optical model whereas a Multilayer Perceptron was employed to predict the likelihood of emission. ...
Article
Full-text available
Machine learning aims to extract hidden information that is present in the data using knowledge of current data on a certain subject. We can achieve machine learning and predict results for unknown data by using specific mathematical functions and concepts to uncover hidden information. Pattern identification is one of ML's primary applications. Large picture data sets are typically used to recognize patterns. An example of pattern recognition through an image is handwriting recognition. We may teach computers to interpret letters and numbers from any language that is contained in an image by employing such notions. Handwritten characters can be recognized using a variety of techniques. In this Project report, we will go over some of the techniques.
... We compared our proposed algorithm with the models of the literature on the IAM and KHATT datasets, and the results obtained are presented in Tables 6 and 7 respectively. Our model reached 9.0% as the CER value on the IAM dataset, where it outperformed a number of traditional models in the current literature, such as the HMM-based models in papers [22] and [20]. The CRNN improved by gMLP network achieved better results than CTC models that encode image features using Multi-Dimensional LSTM (MDLSTM) [27], such as those described in [13,50]. ...
Article
Full-text available
In this work, we present an efficient approach to deal with the Handwritten text recognition (HTR) task. The proposed model combines convolutional and recurrent layers and gMLP networks trained on a sequence of characters rather than words. We experiment our model on lines of text from the popular benchmark datasets of handwriting with different languages and distinct sizes of gMLP. The gMLP networks can detect the spatial interaction between the different target characters, and therefore learn a more precise alignment at each step of the decoding. Our model performs well and achieves high performance of 9.0% in metric CER on the IAM dataset without the help of any lexicon or explicit language model.