Conference PaperPDF Available

Issues in Ground-Truthing Graphic Documents

Authors:

Abstract

We examine the nature of ground-truth: whether it is always well-defined for a given task, or only relative and approximate. In the conventional scenario, reference data is produced by recording the interpretation of each test document using a chosen data-entry platform. Looking a little more closely at this process, we study its constituents and their interrelations. We provide examples from the literature and from our own experiments where non-trivial problems with each of the components appear to preclude the possibility of real progress in evaluating automated graphics recognition systems, and propose possible solutions. More specifically, for documents with complex structure we recommend multi-valued, layered, weighted, functional ground-truth supported by model-guided reference data-entry systems and protocols. Mostly, however, we raise far more questions than we currently have answers for.
GREC 01 Lopresti/Nagy 1
ISSUES IN GROUND-TRUTHING GRAPHIC DOCUMENTS
Daniel Lopresti, Bell Labs, Lucent Technologies
George Nagy, Rensselaer Polytechnic Institute
And diff'ring judgements serve but to declare,
That truth lies somewhere, if we knew but where.
- William Cowper 1731-1800
Is ground truth indeed fixed, unique and static? Or is it, like beauty, in
the eyes of the beholder, relative and approximate? In OCR test datasets,
from Highleyman’s hand-digitized numerals on punched cards to the
U-W, ISRI, NIST, and CEDAR CD-ROMs, the first point of view held sway.
But in our recent experiments on printed tables, we have been forced to
the second. This issue may arise more and more often as researchers
attempt to evaluate recognition systems for increasingly complex graphic
documents.
Strict validation with respect to reference data (i.e., ground truth) seems
appropriate for pattern recognition systems designed for real applications
where an appropriate set of samples is available. (The choice of sampling
strategy for “real applications” is itself a recondite topic that we skirt
here.) We examine the major components that seem to play a part in
determining the nature of reference data. In the conventional scenario,
the reference data is produced by entering some interpretation about
each document using a chosen data-entry platform. Looking a little more
closely at this process, we study its constituents and their interrelations:
Input format. The input represents the data provided to both
interpreters and the recognition system. It is often a pixel array of
optical scans of selected documents or parts thereof. It could also
be in a specialized format: ASCII for text and tables collected from
email, RTF or Latex for partially processed mainly-text documents,
chain codes or medial axis transforms for line drawings.
Model. The model is a formal specification of the appearance and
content of a document. A particular interpretation of the document
is an instance of the model, as is the output of a recognition
system. What do we do if the correct interpretation cannot be
accommodated by the chosen model?
Reference Entry System. This could be as simple as a standard text
editor like vi or Notepad. For graphic documents, some 2-D
interaction is required. DAFS-1.0 (Illuminator), entity graphs, X-Y
trees, and rectangular zone definition systems have been used for
GREC 01 Lopresti/Nagy 2
text layout. We used Daffy for tables. Questions that we will
examine in greater detail are the conformance of the Reference
Entry System to the Model (is it possible to enter reference data
that does correspond to any possible model instance?), and its bias
(does it favor some model instances over others?). To avoid
discrepancies, should we expeditiously redefine the Model as the
set of representations that can be produced by the reference entry
system?
Verification and Reconciliation. Because the reference data should
be more accurate than the recognition system being evaluated, the
reference data is usually entered more than once, preferably by
different operators, or even by trusted automated systems. Where
there is a high degree of consensus, the results of multiple passes
are reconciled. However, in more difficult tasks it may be desirable
to retain several versions of the truth. We may also accept partial
reference information. For instance, we may be satisfied with line-
end coordinates of a drawing even if the reference entry system
allows differentiating between leaders, dimension lines, and part
boundaries. Isn’t all reference data incomplete to a greater or
lesser extent?
Truth format. The output of the previous stage must clearly have
more information than the input. To facilitate comparison, ideally
the ground-truth format is identical, or at least equivalent, to that
of the output of the recognition system.
Personnel. For printed OCR, ordinary literacy is usually considered
sufficient. For handwriting, literacy in the language of the
document may be necessary. For more complicated tasks, some
domain expertise is desirable. We will discuss the effects of subject
matter expertise versus specialized training (as, for instance, for
remote postal address entry, medical forms, or archival engineering
drawing conversion). How much training should be focused on the
model and the reference entry system versus the topical domain?
Is consistency more important than accuracy? Can training itself
introduce a bias that favors one recognition system over another?
Although each of these constituents plays a significant role in most
reported graphic recognition experiments, they are seldom described
explicitly. Perhaps there is something to be gained by spotlighting them
in a situation where they don’t play a subordinate role to new models,
algorithms, data sets, or verification procedures.
GREC 01 Lopresti/Nagy 3
As mentioned, we are interested in scenarios where the evaluation of an
automated system is a quantitative measure based on automated
comparison of the output files of the recognition system with a set of
reference (“truth”) files produced by (human) interpreters from the same
input. This is by no means the only possible scenario for evaluation.
Several other methods, including the following, have merit.
1. The interpreter uses a different input than the recognition
system (for example, hardcopy instead of digitized images).
2. The patterns to be recognized are produced from the truth files,
as in the case of bitmaps of CAD files in GREC dashed-line or
circular curve recognition contests. Do we lose something by
accepting the source files as the unequivocal truth even if the
output lends itself to plausible alternative interpretations?
3. The comparison is not automated: the output of the recognition
system may be evaluated by inspection, or corrected by an
operator in a timed session (a form of goal directed evaluation).
4. There is no reference data or ground-truth: in postal address
reading, the number of undeliverable letters may be a good
measure of the address readers’ performance.
The notion of ambiguity is not of course unique to pattern recognition.
Every linguist, soothsayer and politician has a repertory of ambiguous
statements. Perceptual psychologists delight in figure-ground illusions
that change meaning in the blink of an eye. Motivation is notoriously
ambiguous: “Is the Mona Lisa smiling?” We do not propose to investigate
ambiguity for its own sake, but only insofar as it affects the practical
aspects of evaluating symbol-based document analysis systems.
We provide examples from the literature and from our own experiments
of non-trivial problems with each of the six major constituents of ground
truth. Unless and until they are addressed, these problems appear to
preclude the possibility of real progress in evaluating automated graphic
recognition systems. For some of them, we can propose potential
solutions.
... Recover musical semantics Recover music notation indications of which technique to use), and 3) Apply musical intuition, prior knowledge, and taste to interpret the music and fill in the remaining parameters which music notation did not capture. Note that step (3) is clearly outside of OMR since it needs to deal with information that is not written into the music document-and where human performers start to disagree, although they are reading the very same piece of music [98]. 7 Coming back to our definition of OMR, based on the stages of the writing/reading process we outlined above, there are two fundamental ways to interpret the term "read" in reading music notation as illustrated in Fig. 4. We may wish to: (A) Recover music notation and information from the engraving process, i.e. what elements were selected to express the given piece of music and how they were laid out. ...
Article
Full-text available
For over 50 years, researchers have been trying to teach computers to read music notation, referred to as Optical Music Recognition (OMR). However, this field is still difficult to access for new researchers, especially those without a significant musical background: Few introductory materials are available, and, furthermore, the field has struggled with defining itself and building a shared terminology. In this work, we address these shortcomings by (1) providing a robust definition of OMR and its relationship to related fields, (2) analyzing how OMR inverts the music encoding process to recover the musical notation and the musical semantics from documents, and (3) proposing a taxonomy of OMR, with most notably a novel taxonomy of applications. Additionally, we discuss how deep learning affects modern OMR research, as opposed to the traditional pipeline. Based on this work, the reader should be able to attain a basic understanding of OMR: its objectives, its inherent structure, its relationship to other fields, the state of the art, and the research opportunities it affords.
... In the absence of useful feedback from downstream programs, human correction time (using an appropriate GUI) is a good measure of classifier performance. Unfortunately it turns out that humans are far from infallible or even consensual in labeling document artifacts [32,33]. Lamiroy and Lopresti [34] propose a comprehensive benchmarking system that accommodates multiple ground truths and offers a fresh selection of samples for each experiment for credible performance comparisons. ...
Article
Full-text available
Document analysis tasks for which representative labeled training samples are available have been largely solved. The next frontier is coping with hitherto unseen formats, unusual typefaces, idiosyncratic handwriting and imperfect image acquisition. Adaptive and style-constrained classification methods can overcome some expected variability, but human intervention will remain necessary in many tasks. Interactive pattern recognition includes data exploration and active learning as well as access to stored documents. The principle of “green interaction” is to make use of every intervention to reduce the likelihood that the automated system will make the same mistake again and again. Some of these techniques may pop up in forthcoming personal camera-based memex-like applications that will have a far broader range of input documents and scene text than the current, successful but highly specialized, systems for patents, postal addresses, bank checks and books.
... Recover musical semantics Recover music notation indications of which technique to use), and 3) Apply musical intuition, prior knowledge, and taste to interpret the music and fill in the remaining parameters which music notation did not capture. Note that step (3) is clearly outside of OMR since it needs to deal with information that is not written into the music document-and where human performers start to disagree, although they are reading the very same piece of music [180]. 7 Coming back to our definition of OMR, based on the stages of the writing/reading process we outlined above, there are two fundamental ways to interpret the term "read" in reading music notation as illustrated in Fig. 4. We may wish to: (A) Recover music notation and information from the engraving process, i.e. what elements were selected to express the given piece of music and how were they laid out? ...
Preprint
Full-text available
For over 50 years, researchers have been trying to teach computers to read music notation, referred to as Optical Music Recognition (OMR). However, this field is still difficult to access for new researchers, especially those without a significant musical background: few introductory materials are available, and furthermore the field has struggled with defining itself and building a shared terminology. In this tutorial, we address these shortcomings by (1) providing a robust definition of OMR and its relationship to related fields, (2) analyzing how OMR inverts the music encoding process to recover the musical notation and the musical semantics from documents, (3) proposing a taxonomy of OMR, with most notably a novel taxonomy of applications. Additionally, we discuss how deep learning affects modern OMR research, as opposed to the traditional pipeline. Based on this work, the reader should be able to attain a basic understanding of OMR: its objectives, its inherent structure, its relationship to other fields, the state of the art, and the research opportunities it affords.
... Daniel Lopresti et al [39] define the ground truth as "a set of reference (truth) files produced by (human) interpreters". Oleg Okun and Matti Pietikainen [40] define it as what we should obtain after an ideal segmentation. ...
Thesis
In this thesis I discuss the document image segmentation problem and I describe our new approaches for detecting and classifying document contents. First, I discuss our skew angle estimation approach. The aim of this approach is to develop an automatic approach able to estimate, with precision, the skew angle of text in document images. Our method is based on Maximum Gradient Difference (MGD) and R-signature. Then, I describe our second method based on Ridgelet transform.Our second contribution consists in a new hybrid page segmentation approach. I first describe our stroke-based descriptor that allows detecting text and line candidates using the skeleton of the binarized document image. Then, an active contour model is applied to segment the rest of the image into photo and background regions. Finally, text candidates are clustered using mean-shift analysis technique according to their corresponding sizes. The method is applied for segmenting scanned document images (newspapers and magazines) that contain text, lines and photo regions. Finally, I describe our stroke-based text extraction method. Our approach begins by extracting connected components and selecting text character candidates over the CIE LCH color space using the Histogram of Oriented Gradients (HOG) correlation coefficients in order to detect low contrasted regions. The text region candidates are clustered using two different approaches ; a depth first search approach over a graph, and a stable text line criterion. Finally, the resulted regions are refined by classifying the text line candidates into « text» and « non-text » regions using a Kernel Support Vector Machine K-SVM classifier
... There exist many works in ground truthing in the literature. Lopresti and Nagy (2001) discussed issues in ground truthing graphic documents. They gave an overview of the ground-truthing process and examined whether GT is always well-defined for a given task. ...
Article
Historical documents usually have a complex layout, making them one of the most challenging types of documents for automatic image analysis. In the pipeline of automatic document image analysis (DIA), layout analysis is an important prerequisite for further steps including optical character recognition, script analysis, and image recognition. It aims at splitting a document image into regions of interest such as text lines, background, and decorations. To train a layout analysis system, an essential prerequisite is a set of pages with corresponding ground truth (GT), i.e. existing labels (e.g. text line and decoration) annotated by human experts. Although there exist many methods and tools in GT generation, most of them are not suitable on our specific data sets. In this article, we propose to use Gabor features to generate GT, and based on Gabor features, we developed a web-based interface called DivaDiaWI. DivaDiaWI applies automatic functions using Gabor features to generate GT of text lines. For other region types such as background and decorations, users can manually draw their GT with user-friendly operations. The evaluation shows that (1) DivaDiaWI has two advantages when bringing it into context with state-of-the-art tools, (2) the automatic functions of DivaDiaWI greatly accelerate the GT generation, and (3) DivaDiaWI obtains a high score in a system usability test.
... Such differences might be called "errors" when there is a strong consensus about what constitutes the right answer. In many cases, however, there are legitimate differences of opinion [4,8] by various "readers" of the document, and these may differ from the intention of the author (which is usually hard or impossible to determine, although sometimes we can get access to it [9]). The bottom line is that although standard document collections exist, their annotations or "ground truth" may be specific, recorded in predetermined representations, incomplete or partially erroneous, while, on the other hand, there is a need to collect and manage annotations in ways that make it possible to construct more robust and general document analysis solutions. ...
Conference Paper
A significant amount of research in Document Image Analysis, and Machine Perception in general, relies on the extraction and analysis of signal cues with the goal of interpreting them into higher level information. This paper gives an overview on how this interpretation process is usually considered, and how the research communities proceed in evaluating existing approaches and methods developed for realizing these processes. Evaluation being an essential part to measuring the quality of research and assessing the progress of the state-of-the art, our work aims at showing that classical evaluation methods are not necessarily well suited for interpretation problems, or, at least, that they introduce a strong bias, not necessarily visible at first sight, and that new ways of comparing methods and measuring performance are necessary. It also shows that the infamous Semantic Gap seems to be an inherent and unavoidable part of the general interpretation process, especially when considered within the framework of traditional evaluation. The use of Formal Concept Analysis is put forward to leverage these limitations into a new tool to the analysis and comparison of interpretation contexts.
... As claimed in [LN01], creating a ground-truth for graphic documents is not always straightforward due to ambiguous cases or subjectivity issues. For example, in the architectural field, each architect tends to use its own symbol designs to represent a furniture element. ...
... Lopresti, D and Nagy, G. [LN01] This paper discusses whether ground-truth is fixed, unique and static, or relative and approximate. The authors say the stochastic models can be used to generate synthetic data for experimentation. ...
Thesis
Full-text available
Music is an essential part of our culture and heritage. Throughout the centuries, millions of songs were composed and written down in documents using music notation. Optical Music Recognition (OMR) is the research field that investigates how the computer can learn to read those documents. Despite decades of research, OMR is still considered far from being solved. One reason is that traditional approaches rely heavily on heuristics and often do not generalize well. In this thesis, I propose a different approach to let the computer learn to read music notation documents mostly by itself using machine learning, especially deep learning. In several experiments, I have demonstrated that the computer can learn to robustly solve many tasks involved in OMR by using supervised learning. These include the structural analysis of the document, the detection and classification of symbols in the scores as well as the construction of the music notation graph, which is an intermediate representation that can be exported into a format suitable for further processing. A trained deep convolutional neural network can reliably detect whether an image contains music or not, while another one is capable of finding and linking individual measures across multiple sources for easy navigation between them. Detecting symbols in typeset and handwritten scores can be learned, given a sufficient amount of annotated data, and classifying isolated symbols can be performed at even lower error rates than those of humans. For scores written in mensural notation the complete recognition can even be simplified into just three steps, two of which can be solved with machine learning. Apart from publishing a number of scientific articles, I have gathered and documented the most extensive collection of datasets for OMR as well as the probably most comprehensive bibliography currently available. Both are available online. Moreover I was involved in the organization of the International Workshop on Reading Music Systems, in a joint tutorial at the International Society For Music Information Retrieval Conference on OMR as well as in another workshop at the Music Encoding Conference. Many challenges of OMR can be solved efficiently with deep learning, such as the layout analysis or music object detection. As music notation is a configurational writing system where the relations and interplay between symbols determine the musical semantic, these relationships have to be recognized as well. A music notation graph is a suitable representation for storing this information. It allows to clearly distinguish between the challenges involved in recovering information from the music score image and the encoding of the recovered information into a specific output format while complying with the rules of music notation. While the construction of such a graph can be learned as well, there are still many open issues that need future research. But I am confident that training the computer on a sufficiently large dataset under human supervision is a sustainable approach that will help to solve many applications of OMR in the future.
Chapter
Symbol spotting systems are intended to retrieve regions of interest from a document image database where the queried symbol is likely to be found. They shall have the ability to recognize and locate graphical symbols in a single step. In this chapter, we present a set of measures to evaluate the performance of a symbol spotting system in terms of recognition abilities, location accuracy and scalability. We show that the proposed measures allow determining the weaknesses and strengths of different methods. In particular, we have evaluated in detail the spotting method presented in Chapter 6.
Article
Full-text available
We offer a perspective on the performance of current OCR systems by illustrating and explaining actual OCR errors made by three commercial devices. After discussing briefly the character recognition abilities of humans and computers, we present illustrated examples of recognition errors. The top level of our taxonomy of the causes of errors consists of Imaging Defects, Similar Symbols, Punctuation, and Typography. The analysis of a series of 'snippets' from this perspective provides insight into the strengths and weaknesses of current systems, and perhaps a road map to future progress. The examples were drawn from the large-scale tests conducted by the authors at the Information Science Research Institute of the University of Nevada, Las Vegas. By way of conclusion, we point to possible approaches for improving the accuracy of today's systems. The talk is based on our eponymous monograph, recently published in The Kluwer International Series in Engineering and Computer Science, Kluwer Academic Publishers, 1999.
Article
Full-text available
The following three data representation systems have been proposed in order to construct a suitable database for objectively and automatically evaluating the results of a music score recognition system: 1) Primitive symbol representation including type, size, and position data of each symbol element. 2) Musical symbol representation denoted by a combination of primitive symbols. 3) Hierarchical score representation including music interpretation. The results based upon the extraction of symbols can be evaluated by the data in systems one and two above, and the results of the music interpretation can be evaluated by the data in system three above. In the overall recognition system, the above data is also useful for making standard patterns and setting various parameters.
Chapter
This paper describes the current state of an experimental document recognition system for scientific papers. A scientific paper contains not only text but also tables, pictures, graphics, and mathematical expressions. This system can convert character or symbol strings in text as well as mathematical expressions and tables into coded data. Out of all the functions required for the entire process from document scanning through recognition, these have been investigated and implemented: skew detection and correction, region (block) segmentation, and mathematical expression recognition. The algorithms have been designed for high speed as much as possible. This experimental system is implemented entirely in software on a work station under X-windows. Some experimental results on each stage of the document recognition process are presented.
Article
A new, fully-automated process has been developed at NIST to derive ground truth for document images. The method involves matching optical character recognition (OCR) results from a page with typesetting files for an entire book. Public domain software used to derive the ground truth is provided in the form of Perl scripts and C source code, and includes new, more efficient string alignment technology and a word- level scoring package. With this ground truthing technology, it is now feasible to produce much larger data sets, at much lower cost, than was ever possible with previous labor- intensive, manual data collection projects. Using this method, NIST has produced a new document image database for evaluating Document Analysis and Recognition technologies and Information Retrieval systems. The database produced contains scanned images, SGML-tagged ground truth text, commercial OCR results, and image quality assessment results for pages published in the 1994 Federal Register. These data files are useful in a wide variety of experiments and research. There were roughly 250 issues, comprised of nearly 69,000 pages, published in the Federal Register in 1994. This volume of the database contains the pages of 20 books published in January of that year. In all, there are 4711 page images provided, with 4519 of them having corresponding ground truth. This volume is distributed on two ISO-9660 CD- ROMs. Future volumes may be released, depending on the level of interest.
Article
The scene boundary detection is important in the semantic understanding of video data and is usually determined by coherence between shots. To measure the coherence, two approaches have been proposed. One is a discrete approach and the other one is a continuous approach. In this paper, we use the continuous approach and propose some modifications on the causal First-In-First-Out(FIFO) short-term memory-based model. One modification is that we allow dynamic memory size in computing coherence reliably regardless of the size of each shot. Another modification is that some shots can be removed from the memory buffer not by the FIFO rule. These removed shots have no or small foreground objects. Using this model, we detect scene boundaries by computing shot coherence. In computing coherence, we add one new term which is the number of intermediate shots between two comparing shots because the effect of intermediate shots is important in computing shot recall. In addition, we also consider shot activity because this is important to reflect human perception. We experiment our computing model on different genres of videos and have obtained reasonable results.
Article
Geometric groundtruth at character, word, and line level is crucial for develop- ing and evaluating optical character recognition (OCR) algorithms. Kanungo and Haralick (ICPR '96) proposed a closed loop methodology for generating character level groundtruth for rescanned image. In this article we present a robust version of their methodology. We grouped the feature points and used branch and bound al- gorithm on the grouped feature point set to estimate the transformation. Euclidean distance between character centroids was used as the error metric. We performed experiments on a randomly selected subset of the University of Washington dataset.
Article
The objective of the research to be pursued is to develop a schema for representing raster-digitized (scanned) documents, The representation is to retain not only the spatial structure of a printed document, but should also facilitate automatic labeling of various components, such as text, figures, subtitles, and figure captions, and allow the extraction of important relationships (such as reading order) among them. Intended applications include (1) data compression for document transmission and archival, and (2) document entry, with out rekeying, into editing, formatting, and information retrieval systems.
Article
Document database preparation is a very time-consuming job and usually requires the involvement of many people. Any database is prone to having errors however carefully it was constructed. To assure the high quality of the document image database, a carefully planned implementation methodology is absolutely needed. In this paper, an implementation methodology that we employed to produce the UW English Document Image Database I is described. The paper also discusses how to estimate the distribution of errors contained in a database based on a double-data entry/double verification procedure.
Article
This paper discusses survey data collected as a result of planning a project to evaluate document recognition and information retrieval technologies. In the process of establishing the project, a Request for Comment (RFC) was widely distributed throughout the document recognition and information retrieval research and development (R&D) communities, and based on the responses, the project was discontinued. The purpose of this paper is to present `real' data collected from the R&D communities in regards to a `real' project, so that we may all form our own conclusions about where we are, where we are heading, and how we are going to get there. Background on the project is provided and responses to the RFC are summarized.