Conference Paper

Document image classification using SEMCON

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this paper, we are proposing a new semantic and contextual based document image classification framework. The framework is composed of two main modules. The first one is the text analysis module (TAM) which processes document images and extracts words from the image, and second one is the SEMCON, which is a semantic and contextual objective metric. From the list of extracted words by TAM, SEMCON finds a list of noun terms, employs contextual and semantic meaning to it and then uses those terms to classify documents. The scope of this paper is limited to the proposed framework and testing the approach presented on a limited test dataset. Our preliminary results are very promising and suggest that the proposed framework can be used effectively to classify document images.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Manual annotation and organization is labor intensive, error prone, and time consuming [7]. This raise the demand for an effective and efficient structuring and organization, and classification systems provide methods that can address properly this challenging task [8][9][10]. For that reason in this study we focus on applying machine learning approaches for automating the process of video annotations. ...
Conference Paper
Open educational video resources are gaining popularity with a growing number of massive open online courses (MOOCs). This has created a niche for content providers to adopt effective solutions in automatically organizing and structuring of educational resources for maximum visibility. Recent advances in deep learning techniques are proving useful in managing and classifying resources into appropriate categories. This paper proposes one such convolutional neural network (CNN) model for classifying video lectures in a MOOC setting using a transfer learning approach. The model uses a time-aligned text transcripts corresponding to video lectures from six broader subject categories. Video lectures and their corresponding transcript dataset is gathered from the Coursera MOOC platform. Two different CNN models are proposed: i) CNN based classification using embeddings learned from our MOOC dataset, ii) CNN based classification using transfer learning. Word embeddings generated from two well known state-of-the-art pre-trained models Word2Vec and GloVe, are used in the transfer learning approach for the second case. The proposed CNN models are evaluated using precision, recall, and F1 score and the obtained performance is compared with both conventional and deep learning classifiers. The proposed CNN models have an F1 score improvement of 10-22 percentage points over DNN and conventional classifiers
... We assume that the text documents obtained at this stage are correct since the evaluation of TAM itself is beyond the scope of this paper. Readers are therefore advised to refer to [22] and [23] for full details on the TAM module. ...
Article
Full-text available
This paper provides a comprehensive performance analysis of parametric and non-parametric machine learning classifiers including a deep feed-forward multi-layer perceptron (MLP) network on two variants of improved Concept Vector Space (iCVS) model. In the first variant, a weighting scheme enhanced with the notion of concept importance is used to assess weight of ontology concepts. Concept importance shows how important a concept is in an ontology and it is automatically computed by converting the ontology into a graph and then applying one of the Markov based algorithms. In the second variant of iCVS, concepts provided by the ontology and their semantically related terms are used to construct concept vectors in order to represent the document into a semantic vector space. We conducted various experiments using a variety of machine learning classifiers for three different models of document representation. The first model is a baseline concept vector space (CVS) model that relies on an exact/partial match technique to represent a document into a vector space. The second and third model is an iCVS model that employs an enhanced concept weighting scheme for assessing weights of concepts (variant 1), and the acquisition of terms that are semantically related to concepts of the ontology for semantic document representation (variant 2), respectively. Additionally, a comparison between seven different classifiers is performed for all three models using precision, recall, and F1 score. Results for multiple configurations of deep learning architecture are obtained by varying the number of hidden layers and nodes in each layer, and are compared to those obtained with conventional classifiers. The obtained results show that the classification performance is highly dependent upon the choice of a classifier, and that the Random Forest, Gradient Boosting, and Multilayer Perceptron are among the classifiers that performed rather well for all three models.
Article
Full-text available
The main contribution of this paper is a new method for classifying document images by combining textual features extracted with the Bag of Words (BoW) technique and visual features extracted with the Bag of Visual Words (BoVW) technique. The BoVW is widely used within the computer vision community for scene classification or object recognition but few applications for the classification of entire document images have been submitted. While previous attempts have been showing disappointing results by combining visual and textual features with the Borda-count technique, we're proposing here a combination through learning approach. Experiments conducted on a 1925 document image industrial database reveal that this fusion scheme significantly improves the classification performances. Our concluding contribution deals with the choosing and tuning of the BoW and/or BoVW techniques in an industrial context.
Conference Paper
Full-text available
Domain ontologies are a good starting point to model in a formal way the basic vocabulary of a given domain. However, in order for an ontology to be usable in real applications, it has to be supplemented with lexical resources of this particular domain. The learning process of enriching domain ontologies with new lexical resources employed in the existing approaches takes into account only the contextual aspects of terms and does not consider their semantics. Therefore, this paper proposes a new objective metric namely SEMCON which combines contextual as well as semantic information of terms to enriching the domain ontology with new concepts. The SEMCON defines the context by first computing an observation matrix which exploits the statistical features such as frequency of the occurrence of a term, term’s font type and font size. The semantics is then incorporated by computing a semantic similarity score using lexical database WordNet. Subjective and objective experiments are conducted and results show an improved performance of SEMCON compared with tf*idf and \(\chi ^{2}\).
Conference Paper
Full-text available
This paper proposes a new objective metric called the SEMCON to enrich existing concepts in domain ontologies for describing and organizing multimedia documents. The SEMCON model exploits the document contextually and semantically. The preprocessing module collects a document and partitions that into several passages. Then a morpho-syntatic analysis is performed on the partitioned passages and a list of nouns as part-ofspeech (POS) is extracted. An observation matrix based on statistical features is then computed followed by computing the contextual score. The semantics is then incorporated by computing a semantic similarity score between two terms - term (noun) that is extracted from a document and term that already exists in the ontology as a concept. Eventually, an overall objective score is computed by adding contextual score with semantic score. Subjective experiments are conducted to evaluate the performance of the SEMCON model. The model is compared with state-of-the-art tf*idf and Chi square using F1 measure. The experimental results show that SEMCON achieved an improved accuracy of 10.64 % over the tf*idf and 13.04 % over the Chi square.
Conference Paper
Full-text available
This paper will focus on the semantic representation of verbs in computer systems and its impact on lexical selection problems in machine translation (MT). Two groups of English and Chinese verbs are examined to show that lexical selection must be based on interpretation of the sentences as well as selection restrictions placed on the verb arguments. A novel representation scheme is suggested, and is compared to representations with selection restrictions used in transfer-based MT. We see our approach as closely aligned with knowledge-based MT approaches (KBMT), and as a separate component that could be incorporated into existing systems. Examples and experimental results will show that, using this scheme, inexact matches can achieve correct lexical selection.
Article
Full-text available
This paper proposes a reference free perceptual quality met-ric for blackboard lecture images. The text in the image is mostly affected by high compression ratio and de-noising fil-ters which cause blocking and blurring artifacts. As a result the perceived text quality of the blackboard image degrades. The degraded text is not only difficult to read by humans but it also makes the optical character recognition task even more difficult. Therefore, we put our effort firstly to estimate the presence of these artifacts and then we used it in our pro-posed quality metric. The blocking and blurring features are extracted from the image content on block boundaries with-out the presence of reference image. Thus it makes our met-ric reference free. The metric also uses the visual saliency model to mimic the human visual system (HVS) by focusing only on the distortions in perceptually important regions, i.e. those regions which contains the text. Moreover psychophys-ical experiments are conducted that show very good corre-lation between the mean opinion score and quality scores obtained from our reference free perceptual quality metric (RF-PQM). The correlation results are also compared with standard reference and reference free metric.
Article
Full-text available
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.
Article
Full-text available
Document image classification is an impor- tant step in Office Automation, Digital Libraries, and other document image analysis applications. There is great diversity in document image classifiers: they differ in the problems they solve, in the use of training data to construct class models, and in the choice of docu- ment features and classification algorithms. We survey thisdiverseliteratureusingthreecomponents:theprob- lem statement, the classifier architecture, and perfor- manceevaluation.Thisbringstolightimportantissuesin designing a document classifier, including the definition of document classes, the choice of document features and feature representation, and the choice of classifica- tion algorithm and learning mechanism. We emphasize techniques that classify single-page typeset document images without using OCR results. Developing a gen- eral, adaptable, high-performance classifier is challeng- ing due to the great variety of documents, the diverse criteria used to define document classes, and the ambi- guity that arises due to ill-defined or fuzzy document classes.
Conference Paper
In this paper we present a multipage administrative document image retrieval system based on textual and visual representations of document pages. Individual pages are represented by textual or visual information using a bag-of-words framework. Different fusion strategies are evaluated which allow the system to perform multipage document retrieval on the basis of a single page retrieval system. Results are reported on a large dataset of document images sampled from a banking workflow.
Conference Paper
We present a system for automatic FAX routing which pro- cesses incoming FAX images and forwards them to the correct email alias. The system first performs optical character recognition to find words and in some cases parts of words (we have observed error rates as high as 10 to 20 percent). For all these "noisy" words, a set of features is computed which include internal text features, location features, and relationship features. These features are combined to estimate the relevance of the word in the context of the page and the recipient database. The pa- rameters of the word relevance function are learned from training data using the AdaBoost learning algorithm. Words are then compared to the database of recipients to find likely matches. The recipients are finally ranked by combining the quality of the matches and the relevance of the words. Experiments are presented which demonstrate the eectiveness of this system on a large set of real data.
Conference Paper
One of the most useful thresholding techniques using gray-level histogram of an image is the Otsu method. The objective of this paper is to extend it to the 2-dimensional histogram. The 2-dimensional Otsu method utilizes the gray-level information of each pixel and its spatial correlation information within the neighborhood. This method was compared with the 1-dimensional Otsu method. It was found that the proposed method performs much better when the images are corrupted by noise