Conference PaperPDF Available

Issues in Ground-Truthing Graphic Documents

September 2001
Lecture Notes in Computer Science

September 2001

DOI:10.1007/3-540-45868-9_5

Source
DBLP

Conference: Graphics Recognition Algorithms and Applications, 4th International Workshop, GREC 2001, Kingston, Ontario, Canada, September 7-8, 2001, Selected Papers

Authors:

Daniel Lopresti

Lehigh University

We examine the nature of ground-truth: whether it is always well-defined for a given task, or only relative and approximate. In the conventional scenario, reference data is produced by recording the interpretation of each test document using a chosen data-entry platform. Looking a little more closely at this process, we study its constituents and their interrelations. We provide examples from the literature and from our own experiments where non-trivial problems with each of the components appear to preclude the possibility of real progress in evaluating automated graphics recognition systems, and propose possible solutions. More specifically, for documents with complex structure we recommend multi-valued, layered, weighted, functional ground-truth supported by model-guided reference data-entry systems and protocols. Mostly, however, we raise far more questions than we currently have answers for.

Content uploaded by Daniel Lopresti

Content may be subject to copyright.

GREC 01 Lopresti/Nagy 1

ISSUES IN GROUND-TRUTHING GRAPHIC DOCUMENTS

Daniel Lopresti, Bell Labs, Lucent Technologies

George Nagy, Rensselaer Polytechnic Institute

And diff'ring judgements serve but to declare,

That truth lies somewhere, if we knew but where.

- William Cowper 1731-1800

Is ground truth indeed fixed, unique and static? Or is it, like beauty, in

the eyes of the beholder, relative and approximate? In OCR test datasets,

from Highleyman’s hand-digitized numerals on punched cards to the

U-W, ISRI, NIST, and CEDAR CD-ROMs, the first point of view held sway.

But in our recent experiments on printed tables, we have been forced to

the second. This issue may arise more and more often as researchers

attempt to evaluate recognition systems for increasingly complex graphic

documents.

Strict validation with respect to reference data (i.e., ground truth) seems

appropriate for pattern recognition systems designed for real applications

where an appropriate set of samples is available. (The choice of sampling

strategy for “real applications” is itself a recondite topic that we skirt

here.) We examine the major components that seem to play a part in

determining the nature of reference data. In the conventional scenario,

the reference data is produced by entering some interpretation about

each document using a chosen data-entry platform. Looking a little more

closely at this process, we study its constituents and their interrelations:

Input format. The input represents the data provided to both

interpreters and the recognition system. It is often a pixel array of

optical scans of selected documents or parts thereof. It could also

be in a specialized format: ASCII for text and tables collected from

email, RTF or Latex for partially processed mainly-text documents,

chain codes or medial axis transforms for line drawings.

Model. The model is a formal specification of the appearance and

content of a document. A particular interpretation of the document

is an instance of the model, as is the output of a recognition

system. What do we do if the correct interpretation cannot be

accommodated by the chosen model?

Reference Entry System. This could be as simple as a standard text

editor like vi or Notepad. For graphic documents, some 2-D

interaction is required. DAFS-1.0 (Illuminator), entity graphs, X-Y

trees, and rectangular zone definition systems have been used for

GREC 01 Lopresti/Nagy 2

text layout. We used Daffy for tables. Questions that we will

examine in greater detail are the conformance of the Reference

Entry System to the Model (is it possible to enter reference data

that does correspond to any possible model instance?), and its bias

(does it favor some model instances over others?). To avoid

discrepancies, should we expeditiously redefine the Model as the

set of representations that can be produced by the reference entry

system?

Verification and Reconciliation. Because the reference data should

be more accurate than the recognition system being evaluated, the

reference data is usually entered more than once, preferably by

different operators, or even by trusted automated systems. Where

there is a high degree of consensus, the results of multiple passes

are reconciled. However, in more difficult tasks it may be desirable

to retain several versions of the truth. We may also accept partial

reference information. For instance, we may be satisfied with line-

end coordinates of a drawing even if the reference entry system

allows differentiating between leaders, dimension lines, and part

boundaries. Isn’t all reference data incomplete to a greater or

lesser extent?

Truth format. The output of the previous stage must clearly have

more information than the input. To facilitate comparison, ideally

the ground-truth format is identical, or at least equivalent, to that

of the output of the recognition system.

Personnel. For printed OCR, ordinary literacy is usually considered

sufficient. For handwriting, literacy in the language of the

document may be necessary. For more complicated tasks, some

domain expertise is desirable. We will discuss the effects of subject

matter expertise versus specialized training (as, for instance, for

remote postal address entry, medical forms, or archival engineering

drawing conversion). How much training should be focused on the

model and the reference entry system versus the topical domain?

Is consistency more important than accuracy? Can training itself

introduce a bias that favors one recognition system over another?

Although each of these constituents plays a significant role in most

reported graphic recognition experiments, they are seldom described

explicitly. Perhaps there is something to be gained by spotlighting them

in a situation where they don’t play a subordinate role to new models,

algorithms, data sets, or verification procedures.

GREC 01 Lopresti/Nagy 3

As mentioned, we are interested in scenarios where the evaluation of an

automated system is a quantitative measure based on automated

comparison of the output files of the recognition system with a set of

reference (“truth”) files produced by (human) interpreters from the same

input. This is by no means the only possible scenario for evaluation.

Several other methods, including the following, have merit.

1. The interpreter uses a different input than the recognition

system (for example, hardcopy instead of digitized images).

2. The patterns to be recognized are produced from the truth files,

as in the case of bitmaps of CAD files in GREC dashed-line or

circular curve recognition contests. Do we lose something by

accepting the source files as the unequivocal truth even if the

output lends itself to plausible alternative interpretations?

3. The comparison is not automated: the output of the recognition

system may be evaluated by inspection, or corrected by an

operator in a timed session (a form of goal directed evaluation).

4. There is no reference data or ground-truth: in postal address

reading, the number of undeliverable letters may be a good

measure of the address readers’ performance.

The notion of ambiguity is not of course unique to pattern recognition.

Every linguist, soothsayer and politician has a repertory of ambiguous

statements. Perceptual psychologists delight in figure-ground illusions

that change meaning in the blink of an eye. Motivation is notoriously

ambiguous: “Is the Mona Lisa smiling?” We do not propose to investigate

ambiguity for its own sake, but only insofar as it affects the practical

aspects of evaluating symbol-based document analysis systems.

We provide examples from the literature and from our own experiments

of non-trivial problems with each of the six major constituents of ground

truth. Unless and until they are addressed, these problems appear to

preclude the possibility of real progress in evaluating automated graphic

recognition systems. For some of them, we can propose potential

solutions.

Understanding Optical Music Recognition

Article

Full-text available

May 2020

For over 50 years, researchers have been trying to teach computers to read music notation, referred to as Optical Music Recognition (OMR). However, this field is still difficult to access for new researchers, especially those without a significant musical background: Few introductory materials are available, and, furthermore, the field has struggled with defining itself and building a shared terminology. In this work, we address these shortcomings by (1) providing a robust definition of OMR and its relationship to related fields, (2) analyzing how OMR inverts the music encoding process to recover the musical notation and the musical semantics from documents, and (3) proposing a taxonomy of OMR, with most notably a novel taxonomy of applications. Additionally, we discuss how deep learning affects modern OMR research, as opposed to the traditional pipeline. Based on this work, the reader should be able to attain a basic understanding of OMR: its objectives, its inherent structure, its relationship to other fields, the state of the art, and the research opportunities it affords.

Document analysis systems that improve with use

Article

Full-text available

Mar 2020
INT J DOC ANAL RECOG

George Nagy

Document analysis tasks for which representative labeled training samples are available have been largely solved. The next frontier is coping with hitherto unseen formats, unusual typefaces, idiosyncratic handwriting and imperfect image acquisition. Adaptive and style-constrained classification methods can overcome some expected variability, but human intervention will remain necessary in many tasks. Interactive pattern recognition includes data exploration and active learning as well as access to stored documents. The principle of “green interaction” is to make use of every intervention to reduce the likelihood that the automated system will make the same mistake again and again. Some of these techniques may pop up in forthcoming personal camera-based memex-like applications that will have a far broader range of input documents and scene text than the current, successful but highly specialized, systems for patents, postal addresses, bank checks and books.

Understanding Optical Music Recognition

Preprint

Full-text available

Aug 2019

For over 50 years, researchers have been trying to teach computers to read music notation, referred to as Optical Music Recognition (OMR). However, this field is still difficult to access for new researchers, especially those without a significant musical background: few introductory materials are available, and furthermore the field has struggled with defining itself and building a shared terminology. In this tutorial, we address these shortcomings by (1) providing a robust definition of OMR and its relationship to related fields, (2) analyzing how OMR inverts the music encoding process to recover the musical notation and the musical semantics from documents, (3) proposing a taxonomy of OMR, with most notably a novel taxonomy of applications. Additionally, we discuss how deep learning affects modern OMR research, as opposed to the traditional pipeline. Based on this work, the reader should be able to attain a basic understanding of OMR: its objectives, its inherent structure, its relationship to other fields, the state of the art, and the research opportunities it affords.

Analyse d'images de documents : segmentation du contenu

Thesis

Jul 2014

Mehdi Felhi

In this thesis I discuss the document image segmentation problem and I describe our new approaches for detecting and classifying document contents. First, I discuss our skew angle estimation approach. The aim of this approach is to develop an automatic approach able to estimate, with precision, the skew angle of text in document images. Our method is based on Maximum Gradient Difference (MGD) and R-signature. Then, I describe our second method based on Ridgelet transform.Our second contribution consists in a new hybrid page segmentation approach. I first describe our stroke-based descriptor that allows detecting text and line candidates using the skeleton of the binarized document image. Then, an active contour model is applied to segment the rest of the image into photo and background regions. Finally, text candidates are clustered using mean-shift analysis technique according to their corresponding sizes. The method is applied for segmenting scanned document images (newspapers and magazines) that contain text, lines and photo regions. Finally, I describe our stroke-based text extraction method. Our approach begins by extracting connected components and selecting text character candidates over the CIE LCH color space using the Histogram of Oriented Gradients (HOG) correlation coefficients in order to detect low contrasted regions. The text region candidates are clustered using two different approaches ; a depth first search approach over a graph, and a stable text line criterion. Finally, the resulted regions are refined by classifying the text line candidates into « text» and « non-text » regions using a Kernel Support Vector Machine K-SVM classifier

The use of Gabor features for semi-automatically generated polyon-based ground truth of historical document images

Article

Apr 2017

Historical documents usually have a complex layout, making them one of the most challenging types of documents for automatic image analysis. In the pipeline of automatic document image analysis (DIA), layout analysis is an important prerequisite for further steps including optical character recognition, script analysis, and image recognition. It aims at splitting a document image into regions of interest such as text lines, background, and decorations. To train a layout analysis system, an essential prerequisite is a set of pages with corresponding ground truth (GT), i.e. existing labels (e.g. text line and decoration) annotated by human experts. Although there exist many methods and tools in GT generation, most of them are not suitable on our specific data sets. In this article, we propose to use Gabor features to generate GT, and based on Gabor features, we developed a web-based interface called DivaDiaWI. DivaDiaWI applies automatic functions using Gabor features to generate GT of text lines. For other region types such as background and decorations, users can manually draw their GT with user-friendly operations. The evaluation shows that (1) DivaDiaWI has two advantages when bringing it into context with state-of-the-art tools, (2) the automatic functions of DivaDiaWI greatly accelerate the GT generation, and (3) DivaDiaWI obtains a high score in a system usability test.

Interpretation, Evaluation and the Semantic Gap ... What if We Were on a Side-Track?

Conference Paper

Aug 2013

Bart Lamiroy

A significant amount of research in Document Image Analysis, and Machine Perception in general, relies on the extraction and analysis of signal cues with the goal of interpreting them into higher level information. This paper gives an overview on how this interpretation process is usually considered, and how the research communities proceed in evaluating existing approaches and methods developed for realizing these processes. Evaluation being an essential part to measuring the quality of research and assessing the progress of the state-of-the art, our work aims at showing that classical evaluation methods are not necessarily well suited for interpretation problems, or, at least, that they introduce a strong bias, not necessarily visible at first sight, and that new ways of comparing methods and measuring performance are necessary. It also shows that the infamous Semantic Gap seems to be an inherent and unavoidable part of the general interpretation process, especially when considered within the framework of traditional evaluation. The use of Formal Concept Analysis is put forward to leverage these limitations into a new tool to the analysis and comparison of interpretation contexts.

Geometric and Structural-based Symbol Spotting. Application to Focused Retrieval in Graphic Document Collections

Article

Feb 2010

Marçal Rusiñol Sanabra

THE SAFE USE OF SYNTHETIC DATA IN CLASSIFICATION

Article

Full-text available

Jean Nonnemaker

Dissertation - Self-Learning Optical Music Recognition

Thesis

Full-text available

Jul 2019

Alexander Pacha

Music is an essential part of our culture and heritage. Throughout the centuries, millions of songs were composed and written down in documents using music notation. Optical Music Recognition (OMR) is the research field that investigates how the computer can learn to read those documents. Despite decades of research, OMR is still considered far from being solved. One reason is that traditional approaches rely heavily on heuristics and often do not generalize well. In this thesis, I propose a different approach to let the computer learn to read music notation documents mostly by itself using machine learning, especially deep learning. In several experiments, I have demonstrated that the computer can learn to robustly solve many tasks involved in OMR by using supervised learning. These include the structural analysis of the document, the detection and classification of symbols in the scores as well as the construction of the music notation graph, which is an intermediate representation that can be exported into a format suitable for further processing. A trained deep convolutional neural network can reliably detect whether an image contains music or not, while another one is capable of finding and linking individual measures across multiple sources for easy navigation between them. Detecting symbols in typeset and handwritten scores can be learned, given a sufficient amount of annotated data, and classifying isolated symbols can be performed at even lower error rates than those of humans. For scores written in mensural notation the complete recognition can even be simplified into just three steps, two of which can be solved with machine learning. Apart from publishing a number of scientific articles, I have gathered and documented the most extensive collection of datasets for OMR as well as the probably most comprehensive bibliography currently available. Both are available online. Moreover I was involved in the organization of the International Workshop on Reading Music Systems, in a joint tutorial at the International Society For Music Information Retrieval Conference on OMR as well as in another workshop at the Music Encoding Conference. Many challenges of OMR can be solved efficiently with deep learning, such as the layout analysis or music object detection. As music notation is a configurational writing system where the relations and interplay between symbols determine the musical semantic, these relationships have to be recognized as well. A music notation graph is a suitable representation for storing this information. It allows to clearly distinguish between the challenges involved in recovering information from the music score image and the encoding of the recovered information into a specific output format while complying with the rules of music notation. While the construction of such a graph can be learned as well, there are still many open issues that need future research. But I am confident that training the computer on a sufficiently large dataset under human supervision is a sustainable approach that will help to solve many applications of OMR in the future.

Performance Evaluation of Symbol Spotting Systems

Chapter

May 2010

Symbol spotting systems are intended to retrieve regions of interest from a document image database where the queried symbol is likely to be found. They shall have the ability to recognize and locate graphical symbols in a single step. In this chapter, we present a set of measures to evaluate the performance of a symbol spotting system in terms of recognition abilities, location accuracy and scalability. We show that the proposed measures allow determining the weaknesses and strengths of different methods. In particular, we have evaluated in detail the spotting method presented in Chapter 6.

Optical character recognition: an illustrated guide to the frontier

Article

Full-text available

Dec 1999
Proceedings of SPIE

We offer a perspective on the performance of current OCR systems by illustrating and explaining actual OCR errors made by three commercial devices. After discussing briefly the character recognition abilities of humans and computers, we present illustrated examples of recognition errors. The top level of our taxonomy of the causes of errors consists of Imaging Defects, Similar Symbols, Punctuation, and Typography. The analysis of a series of 'snippets' from this perspective provides insight into the strengths and weaknesses of current systems, and perhaps a road map to future progress. The examples were drawn from the large-scale tests conducted by the authors at the Information Science Research Institute of the University of Nevada, Las Vegas. By way of conclusion, we point to possible approaches for improving the accuracy of today's systems. The talk is based on our eponymous monograph, recently published in The Kluwer International Series in Engineering and Computer Science, Kluwer Academic Publishers, 1999.

Format of Ground Truth Data Used in the Evaluation of the Results of an Optical Music Recognition System

Article

Full-text available

The following three data representation systems have been proposed in order to construct a suitable database for objectively and automatically evaluating the results of a music score recognition system: 1) Primitive symbol representation including type, size, and position data of each symbol element. 2) Musical symbol representation denoted by a combination of primitive symbols. 3) Hierarchical score representation including music interpretation. The results based upon the extraction of symbols can be evaluated by the data in systems one and two above, and the results of the music interpretation can be evaluated by the data in system three above. In the overall recognition system, the above data is also useful for making standard patterns and setting various parameters.

Optical Character Recognition

Book

Jan 1999

An Experimental Implementation of a Document Recognition System for Papers Containing Mathematical Expressions

Chapter

Jan 1992

This paper describes the current state of an experimental document recognition system for scientific papers. A scientific paper contains not only text but also tables, pictures, graphics, and mathematical expressions. This system can convert character or symbol strings in text as well as mathematical expressions and tables into coded data. Out of all the functions required for the entire process from document scanning through recognition, these have been investigated and implemented: skew detection and correction, region (block) segmentation, and mathematical expression recognition. The algorithms have been designed for high speed as much as possible. This experimental system is implemented entirely in software on a work station under X-windows. Some experimental results on each stage of the document recognition process are presented.

Federal Register document image database

Article

Jan 1999
Proceedings of SPIE

A new, fully-automated process has been developed at NIST to derive ground truth for document images. The method involves matching optical character recognition (OCR) results from a page with typesetting files for an entire book. Public domain software used to derive the ground truth is provided in the form of Perl scripts and C source code, and includes new, more efficient string alignment technology and a word- level scoring package. With this ground truthing technology, it is now feasible to produce much larger data sets, at much lower cost, than was ever possible with previous labor- intensive, manual data collection projects. Using this method, NIST has produced a new document image database for evaluating Document Analysis and Recognition technologies and Information Retrieval systems. The database produced contains scanned images, SGML-tagged ground truth text, commercial OCR results, and image quality assessment results for pages published in the 1994 Federal Register. These data files are useful in a wide variety of experiments and research. There were roughly 250 issues, comprised of nearly 69,000 pages, published in the Federal Register in 1994. This volume of the database contains the pages of 20 books published in January of that year. In all, there are 4711 page images provided, with 4519 of them having corresponding ground truth. This volume is distributed on two ISO-9660 CD- ROMs. Future volumes may be released, depending on the level of interest.

Continuous video coherence computing model for detecting scene boundaries

Article

Jul 2001
Proceedings of SPIE

Hang-Bong Kang

The scene boundary detection is important in the semantic understanding of video data and is usually determined by coherence between shots. To measure the coherence, two approaches have been proposed. One is a discrete approach and the other one is a continuous approach. In this paper, we use the continuous approach and propose some modifications on the causal First-In-First-Out(FIFO) short-term memory-based model. One modification is that we allow dynamic memory size in computing coherence reliably regardless of the size of each shot. Another modification is that some shots can be removed from the memory buffer not by the FIFO rule. These removed shots have no or small foreground objects. Using this model, we detect scene boundaries by computing shot coherence. In computing coherence, we add one new term which is the number of intermediate shots between two comparing shots because the effect of intermediate shots is important in computing shot recall. In addition, we also consider shot activity because this is important to reflect human perception. We experiment our computing model on different genres of videos and have obtained reasonable results.

A Point Matching Algorithm for Automatic Generation of Groundtruth for Document Images

Article

Feb 2001

Geometric groundtruth at character, word, and line level is crucial for develop- ing and evaluating optical character recognition (OCR) algorithms. Kanungo and Haralick (ICPR '96) proposed a closed loop methodology for generating character level groundtruth for rescanned image. In this article we present a robust version of their methodology. We grouped the feature points and used branch and bound al- gorithm on the grouped feature point set to estimate the transformation. Euclidean distance between character centroids was used as the error metric. We performed experiments on a randomly selected subset of the University of Washington dataset.

Hierarchical representation of optically scanned documents

Article

Jan 1984

The objective of the research to be pursued is to develop a schema for representing raster-digitized (scanned) documents, The representation is to retain not only the spatial structure of a printed document, but should also facilitate automatic labeling of various components, such as text, figures, subtitles, and figure captions, and allow the extraction of important relationships (such as reading order) among them. Intended applications include (1) data compression for document transmission and archival, and (2) document entry, with out rekeying, into editing, formatting, and information retrieval systems.

Implementation methodology and error analysis for the University of Washington English Document Image Database-I

Article

Feb 1994
Proceedings of SPIE

Document database preparation is a very time-consuming job and usually requires the involvement of many people. Any database is prone to having errors however carefully it was constructed. To assure the high quality of the document image database, a carefully planned implementation methodology is absolutely needed. In this paper, an implementation methodology that we employed to produce the UW English Document Image Database I is described. The paper also discusses how to estimate the distribution of errors contained in a database based on a double-data entry/double verification procedure.

Document image recognition and retrieval: Where are we?

Article

Jan 1999
Proceedings of SPIE

Michael D. Garris

This paper discusses survey data collected as a result of planning a project to evaluate document recognition and information retrieval technologies. In the process of establishing the project, a Request for Comment (RFC) was widely distributed throughout the document recognition and information retrieval research and development (R&D) communities, and based on the responses, the project was discontinued. The purpose of this paper is to present `real' data collected from the R&D communities in regards to a `real' project, so that we may all form our own conclusions about where we are, where we are heading, and how we are going to get there. Background on the project is provided and responses to the RFC are summarized.

Issues in Ground-Truthing Graphic Documents

Abstract

Recommended publications

An automatic image recognition approach

AN APPROACH TO IMAGE OBJECTS RECOGNITION AND INTERPRETATION

Automatic segmentation of lizard spots using an active contour model

Complex and Composite Graphical Symbol Recognition and Retrieval: A Quick Review