Fisheye lens example (from up-to-down): original image of 500 Â 250 pixels centered at the pixel to be classified; the same image with a fisheye lens distortion; the image downsampled to 50 Â 30 values used by the neural network classifier.

Source publication

Improving Offline Handwritten Text Recognition with Hybrid HMM/ANN Models

Article

Full-text available

Apr 2011

This paper proposes the use of hybrid Hidden Markov Model (HMM)/Artificial Neural Network (ANN) models for recognizing unconstrained offline handwritten texts. The structural part of the optical models has been modeled with Markov chains, and a Multilayer Perceptron is used to estimate the emission probabilities. This paper also presents new techni...

Context 1

... experiments reported in this paper are conducted on handwritten text lines from the IAM database [32]. The version 3.0 of this database includes over 1,500 scanned forms of handwritten text from more than 650 different writers, for a total of more than 13,000 fully transcribed handwritten lines, without restrictions on the writing style or the writing instrument used. The sentences have been extracted from the Lancaster-Oslo/Bergen (LOB) text corpus [47]. A writer-independent text line recognition task has been considered. The subset of the IAM database used in this work consists of 6,161 training lines (from 283 writers), 920 validation lines (56 writers), and 2,781 test lines (161 writers). All of these data sets are disjoint, and no writer has contributed to more than one set. These partitions are the same as those used in several works by Bunke et al. [33], [48], [49]. A total of 87,967 instances of 11,320 distinct words occur in the union of the training, validation, and test sets. Lexicon is modeled with 78 characters: 26 lowercase letters, 26 uppercase letters, 10 digits, 14 punctuation marks, the space, and a character for garbage symbols. As described in Section 2.1, an MLP has been used for image cleaning by learning the appropriate filter from examples. Original noisy images from the IAM database and the same images that were cleaned by hand formed the training pairs. Additionally, artificially noised images (created by following the ideas presented in [34]) were also used as training data. In this case, the MLP is used for regression: The input is a fixed-sized moving window of 11 Â 11 pixels centered at the pixel to be cleaned, and the output is the restored value of the current pixel (see Fig. 1). The Enhancer-MLP has two hidden layers of 32 and 16 sigmoid units and one output linear unit. Training was performed using the stochastic version of the backpropagation algorithm with momentum term [46], using the mean-square error (MSE) function. The last column of Table 1 shows the topology of the MLPs which are used for preprocessing. As pointed out in Sections 2.2 and 2.4, two MLPs to classify local extrema as belonging to one of the reference lines (lower line, upper line, line of descenders, or line of ascenders) are needed as part of the slope removal and image size normalization processes. We needed supervised training patterns to train MLPs to classify interest points as belonging to the reference lines. A subset of 1,000 images from the IAM training set has been used. Local extrema of the 1,000 images were semi- automatically labeled using an active learning approach: First, a horizontal projection algorithm was used to classify the points belonging to each reference line of a subset of the 1,000 images; second, the subset of images was manually corrected using a graphical tool designed for this purpose [12]; third, these images were used to train an MLP to classify interest points. With this “pretrained” MLP, interest points of the 1,000 images were automatically classified, and, afterward, all of the images were manually supervised. At the end of this process, we had a training set composed of the interest points of the 1,000 images: 800 lines were used as training data and the remaining 200 lines were used as validation data. The Slope-MLP was trained to classify local extrema as belonging to or not belonging to the lower baseline. The Slope-MLP input is a moving window around the current pixel, being the choice of an appropriate window size, a trade-off between context and input size. To partially overcome this problem, we have opted to use a fisheye distortion centered at the pixel to classify (see Fig. 6 for an example) [12]. The fisheye distortion maintains a very accurate resolution near the center and, at the same time, has a much smaller size than using the original image. In this way, a detailed image near the interest point and a coarse representation of the relative position of the surrounding text is obtained: The input to this Slope-MLP is a window of 500 Â 250 pixels centered at the point to be classified, downsampled to 50 Â 30 values using the fisheye distortion. Two output units with a softmax activation function were used to determine whether or not the current pixel belonged to the lower baseline. After doing a scanning of topologies, two hidden layers of 64 and 16 sigmoid units were used. Size normalization was achieved by using a second MLP, which classifies the local extrema into five classes (the four reference lines and points not belonging to any of these lines). The input to this Normalize-MLP is the same as the Slope-MLP input and the output corresponded to five output units with softmax activation function. We used two hidden layers of 64 and 128 sigmoid units. Both MLPs, Slope-MLP and Normalize-MLP, were trained using the stochastic version of the backpropagation algorithm with momentum term and the cross-entropy error function. As described in Section 2.3, part of the process of slant removal needs an MLP to determine whether or not an image has slant. The same set of 1,000 images was manually slant-corrected in a nonuniform way by using a graphical tool. The user specifies a series of slant angles which are interpolated for every image column. This information is used to train the Slant-MLP. As before, 200 images were used for validation. Each image is sheared for different integer angles from À 50 to þ 50 and resized to 40 pixels height, preserving the aspect ratio. The input to this Slant-MLP is a square of 40 Â 40 pixels centered at the column to be evaluated, and the output is a measure of the local slant presence (shown as gray levels in Fig. 4). After doing a parameter and topology scanning, two hidden layers of 64 and 8 units were used. Training was performed using the stochastic version of the backpropagation algorithm with momentum term, using the mean-square error function. A word bigram language model was trained with three different text corpora: the LOB corpus [47] (excluding those sentences that contain lines from the test set of the IAM database), the Brown corpus [50], and the Wellington corpus [51]. . In Sentences: order to cope 51,560 with the LOB fact sentences that lines (2,134 are fragments sentences of sentences, which we have contained randomly IAM broken test lines each were sentence eliminated), from the corpus 51,763 into fragments Brown sentences, to resemble and lines. 20,592 All of Wellington this text is supplemented sentences. with the training lines from the IAM database. . Fragments Then, the of final sentences training to material resemble is lines: comprised More than of: 400,000 lines randomly obtained from the above set of sentences. . Lines: Finally, the 6,161 IAM training lines were also added. The bigram language model used in the recognition systems was generated, using the SRI Language Modeling Toolkit [52] with the modified Kneser-Ney back-off discounting. To achieve unconstrained handwriting recognition, an open dictionary, composed of the 20,000 most frequently occurring (case insensitive) words in the training material, was used to test our recognition systems. . Sentences: 51,560 LOB sentences (2,134 sentences which contained IAM test lines were eliminated), 51,763 Brown sentences, and 20,592 Wellington sentences. . Fragments of sentences to resemble lines: More than 400,000 lines randomly obtained from the above set of sentences. . Lines: Finally, the 6,161 IAM training lines were also added. The bigram language model used in the recognition systems was generated, using the SRI Language Modeling Toolkit [52] with the modified Kneser-Ney back-off discounting. To achieve unconstrained handwriting recognition, an open dictionary, composed of the 20,000 most frequently occurring (case insensitive) words in the training material, was used to test our recognition systems. The recognition performance was measured in terms of the Word Error Rate (WER), which is computed by comparing the output of the recognizer with the reference transcription. WER is defined as the number of word errors (insertions, substitutions, and deletions) summed over the whole test set and divided by the total number of words in the transcriptions of the reference set: A null WER is only reached if the recognizer output matches the reference transcription exactly. The Character Error Rate (CER) was also measured for the final test experiments. CER is defined as expression (5), but with characters instead of words. In order to properly compare different systems, it is highly desirable to provide not only the value of the WER (or CER) but also a confidence ...

View in full-text

Weakly Supervised Learning of Interactions between Humans and Objects

Article

Full-text available

Jul 2011

We introduce a weakly supervised approach for learning human actions modeled as interactions between humans and objects. Our approach is human-centric: We first localize a human in the image and then determine the object relevant for the action and its spatial relation with the human. The model is learned automatically from a set of still images an...

A Pix2Pix Architecture for Complete Offline Handwritten Text Normalization

Article

Full-text available

Jun 2024
SENSORS-BASEL

In the realm of offline handwritten text recognition, numerous normalization algorithms have been developed over the years to serve as preprocessing steps prior to applying automatic recognition models to handwritten text scanned images. These algorithms have demonstrated effectiveness in enhancing the overall performance of recognition architectures. However, many of these methods rely heavily on heuristic strategies that are not seamlessly integrated with the recognition architecture itself. This paper introduces the use of a Pix2Pix trainable model, a specific type of conditional generative adversarial network, as the method to normalize handwritten text images. Also, this algorithm can be seamlessly integrated as the initial stage of any deep learning architecture designed for handwritten recognition tasks. All of this facilitates training the normalization and recognition components as a unified whole, while still maintaining some interpretability of each module. Our proposed normalization approach learns from a blend of heuristic transformations applied to text images, aiming to mitigate the impact of intra-personal handwriting variability among different writers. As a result, it achieves slope and slant normalizations, alongside other conventional preprocessing objectives, such as normalizing the size of text ascenders and descenders. We will demonstrate that the proposed architecture replicates, and in certain cases surpasses, the results of a widely used heuristic algorithm across two metrics and when integrated as the first step of a deep recognition architecture.

On the improvement of handwritten text line recognition with octave convolutional recurrent neural networks

Article

Full-text available

Feb 2024
INT J DOC ANAL RECOG

Off-line handwritten text recognition (HTR) poses a significant challenge due to the complexities of variable handwriting styles, background degradation, and unconstrained word sequences. This work tackles the handwritten text line recognition problem using octave convolutional recurrent neural networks (OctCRNN). Our approach requires no word segmentation, preprocessing, or explicit feature extraction and leverages octave convolutions to process multiscale features without increasing the number of learnable parameters. We investigate the OctCRNN under different settings, including an octave design that efficiently balances computational cost and recognition performance. We thoroughly investigate the OctCRNN under different settings by formulating an experimental pipeline with a visualization step to get intuitions about how the model works compared to a counterpart based on traditional convolutions. The system becomes complete by adding a language model to increase linguistic knowledge. Finally, we assess the performance of our solution using character and word error rates against established handwritten text recognition benchmarks: IAM, RIMES, and ICFHR 2016 READ. According to the results, our proposal achieves state-of-the-art performance while reducing the computational requirements. Our findings suggest that the architecture provides a robust framework for building HTR systems.

Eavesdropping on Controller Acoustic Emanation for Keystroke Inference Attack in Virtual Reality

Conference Paper

Full-text available

Jan 2024

Classification of images using Extreme Learning Machine and Regularization Extreme learning Machine (ELM) with the leave-one-out cross validation

Conference Paper

Full-text available

Dec 2023

Saima Jamil

Gated Convolution and Stacked Self-Attention Encoder–Decoder-Based Model for Offline Handwritten Ethiopic Text Recognition

Article

Full-text available

Dec 2023

Offline handwritten text recognition (HTR) is a long-standing research project for a wide range of applications, including assisting visually impaired users, humans and robot interactions, and the automatic entry of business documents. However, due to variations in writing styles, visual similarities between different characters, overlap between characters, and source document noise, designing an accurate and flexible HTR system is challenging. The problem becomes serious when the algorithm has a low learning capacity and when the text used is complex and has a lot of characters in the writing system, such as Ethiopic script. In this paper, we propose a new model that recognizes offline handwritten Ethiopic text using a gated convolution and stacked self-attention encoder–decoder network. The proposed model has a feature extraction layer, an encoder layer, and a decoder layer. The feature extraction layer extracts high-dimensional invariant feature maps from the input handwritten image. Using the extracted feature maps, the encoder and decoder layers transcribe the corresponding text. For the training and testing of the proposed model, we prepare an offline handwritten Ethiopic text-line dataset (HETD) with 2800 samples and a handwritten Ethiopic word dataset (HEWD) with 10,540 samples obtained from 250 volunteers. The experiment results of the proposed model on HETD show a 9.17 and 13.11 Character Error Rate (CER) and Word Error Rate (WER), respectively. However, the model on HEWD shows an 8.22 and 9.17 CER and WER, respectively. These results and the prepared datasets will be used as a baseline for future research.

Comparing Machine Learning and Deep Learning Techniques for Text Analytics: Detecting the Severity of Hate Comments Online

Article

Full-text available

Nov 2023
INFORM SYST FRONT

Social media platforms have become an increasingly popular tool for individuals to share their thoughts and opinions with other people. However, very often people tend to misuse social media posting abusive comments. Abusive and harassing behaviours can have adverse effects on people's lives. This study takes a novel approach to combat harassment in online platforms by detecting the severity of abusive comments, that has not been investigated before. The study compares the performance of machine learning models such as Naïve Bayes, Random Forest, and Support Vector Machine, with deep learning models such as Convolutional Neural Network (CNN) and Bi-directional Long Short-Term Memory (Bi-LSTM). Moreover, in this work we investigate the effect of text pre-processing on the performance of the machine and deep learning models, the feature set for the abusive comments was made using unigrams and bigrams for the machine learning models and word embeddings for the deep learning models. The comparison of the models’ performances showed that the Random Forest with bigrams achieved the best overall performance with an accuracy of (0.94), a precision of (0.91), a recall of (0.94), and an F1 score of (0.92). The study develops an efficient model to detect severity of abusive language in online platforms, offering important implications both to theory and practice.

Handwritten Digit Recognition Using CNN with Average Pooling and Global Average Pooling

Conference Paper

Full-text available

Sep 2023

Auxiliary Cross-Modal Representation Learning With Triplet Loss Functions for Online Handwriting Recognition

Article

Full-text available

Aug 2023

Cross-modal representation learning learns a shared embedding between two or more modalities to improve performance in a given task compared to using only one of the modalities. Cross-modal representation learning from different data types – such as images and time-series data (e.g., audio or text data) – requires a deep metric learning loss that minimizes the distance between the modality embeddings. In this paper, we propose to use the contrastive or triplet loss, which uses positive and negative identities to create sample pairs with different labels, for cross-modal representation learning between image and time-series modalities (CMR-IS). By adapting the triplet loss for cross-modal representation learning, higher accuracy in the main (time-series classification) task can be achieved by exploiting additional information of the auxiliary (image classification) task. We present a triplet loss with a dynamic margin for single label and sequence-to-sequence classification tasks. We perform extensive evaluations on synthetic image and time-series data, and on data for offline handwriting recognition (HWR) and on online HWR from sensor-enhanced pens for classifying written words. Our experiments show an improved classification accuracy, faster convergence, and better generalizability due to an improved cross-modal representation. Furthermore, the more suitable generalizability leads to a better adaptability between writers for online HWR.

Handwritten English Character Recognition using CNN

Article

Full-text available

Jul 2023

Machine learning aims to extract hidden information that is present in the data using knowledge of current data on a certain subject. We can achieve machine learning and predict results for unknown data by using specific mathematical functions and concepts to uncover hidden information. Pattern identification is one of ML's primary applications. Large picture data sets are typically used to recognize patterns. An example of pattern recognition through an image is handwriting recognition. We may teach computers to interpret letters and numbers from any language that is contained in an image by employing such notions. Handwritten characters can be recognized using a variety of techniques. In this Project report, we will go over some of the techniques.

gMLP guided deep networks model for character-based handwritten text transcription

Article

Full-text available

Jul 2023
MULTIMED TOOLS APPL

In this work, we present an efficient approach to deal with the Handwritten text recognition (HTR) task. The proposed model combines convolutional and recurrent layers and gMLP networks trained on a sequence of characters rather than words. We experiment our model on lines of text from the popular benchmark datasets of handwriting with different languages and distinct sizes of gMLP. The gMLP networks can detect the spatial interaction between the different target characters, and therefore learn a more precise alignment at each step of the decoding. Our model performs well and achieves high performance of 9.0% in metric CER on the IAM dataset without the help of any lexicon or explicit language model.

Fisheye lens example (from up-to-down): original image of 500 Â 250 pixels centered at the pixel to be classified; the same image with a fisheye lens distortion; the image downsampled to 50 Â 30 values used by the neural network classifier.

Context in source publication

Similar publications

Citations