The setup while photocopying a thick, bound document. The center of...

Document Image Quality Assessment: A Survey

Article

Full-text available

Jun 2023
ACM COMPUT SURV

The rapid emergence of new portable capturing technologies has significantly increased the number and diversity of document images acquired for business and personal applications. The performance of document image processing systems and applications depends directly on the quality of the document images captured. Therefore, estimating the document's image quality is an essential step in the early stages of the document analysis pipeline. This paper surveys research on Document Image Quality Assessment (DIQA). We first provide a detailed analysis of both subjective and objective DIQA methods. Subjective methods, including ratings and pair-wise comparison-based approaches, are based on human opinions. Objective methods are based on quantitative measurements, including document modeling and human perception-based methods. Second, we summarize the types and sources of document degradations and techniques used to model degradations. In addition, we thoroughly review two standard measures to characterize document image quality: Optical Character Recognition (OCR)-based and objective human perception-based. Finally, we outline open challenges regarding developing DIQA methods and provide insightful discussion and future research directions for this problem. This survey will become an essential resource for the document analysis research community and serve as a basis for future research.

Forensic Examination of Electronic DocumentsForensic Examination of Electronic DocumentsForensic Examination of Electronic DocumentsForensic Examination of Electronic Documents

Article

Full-text available

Oct 2022

[Purpose] The purpose of the study is to reveal the concept and essence of forensic prevention of crimes of forgery of electronic documents, to identify problems in the use of information to establish a system of countering crime and document management. [Methodology] The following approaches were used in the work: system-structural, dialectical, empirical. Forgery of electronic documents and their use is investigated not only within the framework of a single criminal case but also by a set of crimes committed depending on the mechanism that is the main one in the structure of criminal technologies. [Findings] Lack of skills and knowledge about the latest forms of documents, methods of their forgery and use in the field of forensic investigations determine the reasons for the development of this condition. The analysis of investigative and judicial practice shows that cases of forgery of electronic documents are moved to separate proceedings due to the inability to fix the person who committed the crime. In some cases, court procedures are returned for additional investigation, since investigators cannot establish mechanisms for falsification tools and bring appropriate charges. [Practical Implications] The practical significance lies in the formation of proposals for improving or making changes to the legislation, effectively improving the activities of law enforcement agencies involved in countering or combating the forgery of documents.

Détection des fraudes : de l’image à la sémantique du contenu : application à la vérification des informations extraites d’un corpus de tickets de caisse

Thesis

Feb 2019

Chloe Artaud

Les entreprises, les administrations, et parfois les particuliers, doivent faire face à de nombreuses fraudes sur les documents qu’ils reçoivent de l’extérieur ou qu’ils traitent en interne. Les factures, les notes de frais, les justificatifs... tout document servant de preuve peut être falsifié dans le but de gagner plus d’argent ou de ne pas en perdre. En France, on estime les pertes dues aux fraudes à plusieurs milliards d’euros par an. Étant donné que le flux de documents échangés, numériques ou papiers, est très important, il serait extrêmement coûteux en temps et en argent de les faire tous vérifier par des experts de la détection des fraudes. C’est pourquoi nous proposons dans notre thèse un système de détection automatique des faux documents. Si la plupart des travaux en détection automatique des faux documents se concentrent sur des indices graphiques, nous cherchons quant à nous à vérifier les informations textuelles du document afin de détecter des incohérences ou des invraisemblances. Pour cela, nous avons tout d’abord constitué un corpus de tickets de caisse que nous avons numérisés et dont nous avons extrait le texte. Après avoir corrigé les sorties de l’OCR et fait falsifier une partie des documents, nous en avons extrait les informations et nous les avons modélisées dans une ontologie, afin de garder les liens sémantiques entre elles. Les informations ainsi extraites, et augmentées de leurs possibles désambiguïsations, peuvent être vérifiées les unes par rapport aux autres au sein du document et à travers la base de connaissances constituée. Les liens sémantiques de l’ontologie permettent également de chercher l’information dans d’autres sources de connaissances, et notamment sur Internet.

A Unified Preprocessing Technique for Enhancement of Degraded Document Images

Chapter

Full-text available

Jan 2019

The field of Document Image Processing has encountered sensational development and progressively across the board relevance lately. Luckily, propels in PC innovation have kept pace with the fast development in the volume of picture information in different applications. One such utilization of Document picture preparing is OCR (Optical Character Recognition). Pre-preparing is one of the pre-imperative stages in the handling of record pictures which changes the archive to a frame reasonable for ensuing stages. In this paper, various preprocessing techniques are proposed for the enhancement of degraded document images. The algorithms implemented are adept at handling variety of noises that include foxing effect, illumination correction, show through effect, stain marks, and pen and other scratch marks removal. The techniques devised works based on noise degradation models generated from the attributes of noisy pixels which are commonly found in degraded or ancient document images. Further, these noise models are employed for the detection of noisy regions in the image to undergo the enhancement process. The enhancement procedures employed include the local normalization, convolution using central measures like mean and standard deviation, and Sauvola’s adaptive binarization technique. The outcomes of the preprocessing procedure is very promising and are adaptable to various degraded document scenarios.

DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images

Article

Full-text available

Dec 2017

Most digital libraries that provide user-friendly interfaces, enabling quick and intuitive access to their resources, are based on Document Image Analysis and Recognition (DIAR) methods. Such DIAR methods need ground-truthed document images to be evaluated/compared and, in some cases, trained. Especially with the advent of deep learning-based approaches, the required size of annotated document datasets seems to be ever-growing. Manually annotating real documents has many drawbacks, which often leads to small reliably annotated datasets. In order to circumvent those drawbacks and enable the generation of massive ground-truthed data with high variability, we present DocCreator, a multi-platform and open-source software able to create many synthetic image documents with controlled ground truth. DocCreator has been used in various experiments, showing the interest of using such synthetic images to enrich the training stage of DIAR tools.

Staff-line detection and removal using a convolutional neural network

Article

Full-text available

Aug 2017
MACH VISION APPL

Staff-line removal is an important preprocessing stage for most optical music recognition systems. Common procedures to solve this task involve image processing techniques. In contrast to these traditional methods based on hand-engineered transformations, the problem can also be approached as a classification task in which each pixel is labeled as either staff or symbol, so that only those that belong to symbols are kept in the image. In order to perform this classification, we propose the use of convolutional neural networks, which have demonstrated an outstanding performance in image retrieval tasks. The initial features of each pixel consist of a square patch from the input image centered at that pixel. The proposed network is trained by using a dataset which contains pairs of scores with and without the staff lines. Our results in both binary and grayscale images show that the proposed technique is very accurate, outperforming both other classifiers and the state-of-the-art strategies considered. In addition, several advantages of the presented methodology with respect to traditional procedures proposed so far are discussed.

On the stability of document analysis algorithms. : Application to hybrid document hashing technologies

Thesis

Dec 2016

Sébastien Eskenazi

An innumerable number of documents is being printed, scanned, faxed, photographed every day. These documents are hybrid : they exist as both hard copies and digital copies. Moreover their digital copies can be viewed and modified simultaneously in many places. With the availability of image modification software, it has become very easy to modify or forge a document. This creates a rising need for an authentication scheme capable of handling these hybrid documents. Current solutions rely on separate authentication schemes for paper and digital documents. Other solutions rely on manual visual verification and offer only partial security or require that sensitive documents be stored outside the company’s premises and a network access at the verification time. In order to overcome all these issues we propose to create a semantic hashing algorithm for document images. This hashing algorithm should provide a compact digest for all the visually significant information contained in the document. This digest will allow current hybrid security systems to secure all the document. This can be achieved thanks to document analysis algorithms. However those need to be brought to an unprecedented level of performance, in particular for their reliability which depends on their stability. After defining the context of this study and what is a stable algorithm, we focused on producing stable algorithms for layout description, document segmentation, character recognition and describing the graphical parts of a document.

Evaluation of the Stability of Four Document Segmentation Algorithms

Conference Paper

Apr 2016

The Delaunay Document Layout Descriptor

Conference Paper

Full-text available

Sep 2015

Security applications related to document authentication require an exact match between an authentic copy and the original of a document. This implies that the documents analysis algorithms that are used to compare two documents (original and copy) should provide the same output. This kind of algorithm includes the computation of layout de-scriptors from the segmentation result, as the layout of a document is a part of its semantic content. To this end, this paper presents a new layout descriptor that significantly improves the state of the art. The basic of this descriptor is the use of a Delaunay triangulation of the centroids of the document regions. This triangulation is seen as a graph and the adjacency matrix of the graph forms the descriptor. While most layout descriptors have a stability of 0% with regard to an exact match, our descriptor has a stability of 74% which can be brought up to 100% with the use of an appropriate matching algorithm. It also achieves 100% accuracy and retrieval in a document retrieval scheme on a database of 960 document images. Furthermore, this descriptor is extremely efficient as it performs a search in constant time with respect to the size of the document database and it reduces the size of the index of the database by a factor 400.

Let's be done with thresholds !

Conference Paper

Full-text available

Aug 2015

Current security applications rely on the performances of the algorithms that they use. For document authen-tication, document analysis algorithms should be precise enough to detect any modification. They should also be stable enough so that a document and its photocopy yield the same result. This requirement is an absolute stability. Having close values is not enough. They need to be exactly the same. This paper presents our preliminary work on the case of a stable layout descriptor. While everyone knows that thresholds are a source of instability, they are still common practice. We describe a promising layout descriptor which drastically reduces the number of thresholds compared to the state of the art. Unfortunately, it is not stable enough when tested on real data. There are still too many thresholds. This paper opens and justifies the path towards algorithms without any threshold.

The setup while photocopying a thick, bound document. The center of perspectivity is at O; which is also the origin of the coordinate frame.

Citations