ArticlePDF Available

An Overview of Data Extraction From Invoices

January 2024
IEEE Access PP(99):1-1

January 2024
PP(99):1-1

DOI:10.1109/ACCESS.2024.3360528

License
CC BY 4.0

Authors:

Frédéric Lardeux

University of Angers

Frédéric Saubion

University of Angers

This paper provides a comprehensive overview of the process for information retrieval from invoices. Invoices serve as proof of purchase and contain important information, including the date, description, quantity, and the price of goods or services, as well as the terms of payment. Companies must process invoices quickly and accurately to maintain proper financial records. To automate this workflow, commercial systems have been developed. Despite the complexity involved, realizing automated processing of invoices necessitates the harmonious integration of a wide range of techniques and methods. While several surveys have shed light on different aspects of this workflow, our objective in this paper is to present a synthetic view of the process and emphasize the most pertinent challenges. We discuss the digitalization of invoices and the use of natural language processing techniques to extract relevant information. We also review machine learning and deep learning techniques that are widely used to handle the variability of layouts, minimize end-user tasks, and train and adapt to new contexts. The purpose of this overview is not to evaluate various systems and algorithms, but rather to propose a survey that reviews a wide scope of techniques for different data extraction tasks, addressing both information extraction and structure recognition for invoice processing. Specifically, we focus on table processing, paying particular attention to graph-based approaches.

An invoice sample

…

Advantages and disadvantages of NER methods according to the state of the art

…

Different steps for in an ETL process

…

Summary of available datasets for document analysis and recognition

…

Figures - available via license: Creative Commons Attribution 4.0 International

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identifier

An Overview of Data Extraction from Invoices

THOMAS SAOUT, FRÉDÉRIC LARDEUX, and FRÉDÉRIC SAUBION

Univ Angers, LERIA, SFR MATHSTIC, F-49000 Angers, France (e-mail: ﬁrstname.lastname@univ-angers.fr)

Corresponding author: Thomas Saout (e-mail: thomas.saout@etud.univ-angers.fr).

This work is supported by KS2 company

ABSTRACT This paper provides a comprehensive overview of the process for information retrieval

from invoices. Invoices serve as proof of purchase and contain important information, including the

date, description, quantity, and the price of goods or services, as well as the terms of payment.

Companies must process invoices quickly and accurately to maintain proper ﬁnancial records. To

automate this workﬂow, commercial systems have been developed. Despite the complexity involved,

realizing automated processing of invoices necessitates the harmonious integration of a wide range of

techniques and methods. While several surveys have shed light on diﬀerent aspects of this workﬂow,

our objective in this paper is to present a synthetic view of the process and emphasize the most

pertinent challenges. We discuss the digitalization of invoices and the use of natural language

processing techniques to extract relevant information. We also review machine learning and deep

learning techniques that are widely used to handle the variability of layouts, minimize end-user

tasks, and train and adapt to new contexts. The purpose of this overview is not to evaluate various

systems and algorithms, but rather to propose a survey that reviews a wide scope of techniques for

diﬀerent data extraction tasks, addressing both information extraction and structure recognition

for invoice processing. Speciﬁcally, we focus on table processing , paying particular attention to

graph-based approaches.

INDEX TERMS Invoice Processing - Table Recognition - Information Extraction

I. INTRODUCTION

Invoices are crucial documents for companies as they

serve as proof of purchase and are necessary for account-

ing and tax purposes. They are created by the seller

and sent to the buyer to request payment for goods or

services. Invoices typically contain essential information

such as the purchase date, the description of goods

or services, the quantity and price, and the payment

terms. Companies need to process invoices promptly

and accurately to maintain proper ﬁnancial records and

avoid potential payment delays. Digitizing invoices can

help streamline the process and reduce the risk of errors.

Paper invoices can be converted into a digital format,

and automated systems can extract critical information

like invoice numbers, amounts, and dates. This approach

can speed up processing time and improve accuracy.

Furthermore, digital invoices can be easily stored and

accessed through document management systems, mak-

ing it simpler to keep track of them and retrieve them

when needed.

The process of automated invoice processing requires

the handling of several document characteristics, such

as varying formats and layouts of invoices, diﬀerences

in language and terminology, and errors or inaccuracies

in the data [31]. This can present challenges, but with

advanced techniques such as machine learning and deep

learning, the process can be automated and made more

accurate to accomplish the following objectives:

•eﬀectively handle the variability of layouts: due

to the lack of a global standard, invoices often

exhibit signiﬁcantly diﬀerent formatting. Naturally,

the required legal information varies from country

to country, and furthermore, it can be arranged

in various ways within the document. Hence, it is

crucial to have labeling and typing techniques in

place to isolate the key elements of an invoice.

•train and rapidly adapt to new contexts: in a prac-

tical scenario, companies often lack a substantial

corpus of invoices that are properly labeled for

learning or testing purposes. However, for small

companies, the invoices they handle are typically

speciﬁc since they originate from a relatively limited

number of customers and suppliers. Consequently, it

should be feasible to customize a system eﬀortlessly

VOLUME 10, 2022 1

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3360528

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Saoutet al.: An Overview of Data Extraction from Invoices

for a particular situation,

•minimize the end-user task: while some systems rely

on predeﬁned invoice presentation styles, modify-

ing these layouts typically requires extensive user

interaction. Although it is important to engage the

user in formulating their needs and specifying the

desired information and management rules, it is

essential to minimize the laborious manual tasks

involved in system tuning,

•eﬃciently detect and extract tables from the in-

voices: tables play a crucial role in invoices, primar-

ily used to present accounting information. How-

ever, their formatting can vary signiﬁcantly, and in

some cases, they may only be suggested without

explicit graphic delimiters. Consequently, the detec-

tion of tables within invoices represents a signiﬁcant

challenge for automated systems, leveraging the

distinctive characteristics of invoices compared to

more generalized documents that rely on headings

or predeﬁned elements.

Automated processing of documents requires dedi-

cated approaches based on the targeted domain. For in-

stance, legal texts require speciﬁc techniques [17], [42].

The analysis of administrative documents, including

invoices, has been an active area of research for many

years [13]. The task is complex because invoices can

come in various formats and contain a wide range of

information such as invoice numbers, amounts, dates,

and payment terms [31]. The lack of structure in

documents poses a real challenge for companies [12].

To address this complexity, various techniques have

been developed, such as Optical Character Recognition

(OCR) [127] for digitizing paper invoices and natural

language processing (NLP) techniques for extracting

relevant information from the text. Neural networks are

also frequently used for document classiﬁcation tasks

[138].

Commercial systems have been developed by compa-

nies like ITESOFT1. and ABBYY2[123] to automate

the processing of invoices. These systems use a combi-

nation of OCR, NLP, and machine learning techniques

to extract information from invoices and process them

automatically. By integrating with the company’s exist-

ing systems, such as accounting and enterprise resource

planning (ERP) systems, these systems streamline the

invoice processing workﬂow into a global electronic

document management system (EDMS) [63]. Recent

advances have led to the development of other end-to-

end solutions for invoices [6].

Processing invoices requires complex administrative

procedures and involves diﬀerent departments such as

accounting, logistics, and supply chain. To ensure eﬃ-

ciency and accuracy, speciﬁc workﬂows are often used

1https://www.itesoft.com

2https://www.abbyy.com

[56]. These workﬂows typically involve multiple steps,

such as document digitization, information extraction,

and data validation, as well as security considerations

[97]. Since invoices can take on various forms, statistical

learning methods have been used to detect their possible

classes [129].

The step of digitizing documents involves utilizing

OCR technology to convert paper invoices into a dig-

ital format, allowing them to be processed and stored

electronically with ease. Next comes the information

extraction phase, which entails identifying the various

identiﬁers such as types, amounts, dates, and other

crucial details from the invoices. To achieve this, natural

language processing (NLP) techniques, such as named

entity recognition (NER), are typically employed, which

aids in recognizing and extracting speciﬁc information

from the text [50], [52].

Even if outside the scope of this overview, it is worth

noting that classiﬁcation techniques have been proposed

for managing sets of invoices and categorizing ﬁnancial

transactions based on their economic nature [9], [132].

Machine learning can also be used to forecast ﬁnancial

data [55] related to invoicing, and time series tools such

as [141]–[143] are particularly useful for this purpose.

There have been many proposed solutions for man-

aging information contained in scanned invoices, and

most of these solutions are based on machine learning

techniques, which have seen recent advances [50], [102].

In general, probabilistic and statistical approaches seem

to be a natural way of understanding documents [88].

The ﬁrst challenge in this ﬁeld was identifying invoices

from a set of documents [71], and models have been

proposed to streamline this process [22].

Once invoices have been correctly scanned and identi-

ﬁed, the next challenge is to extract relevant information

from them. Labeling techniques can be applied using

rules [33], but recent research has focused on using neu-

ral networks (NN) for named entity recognition (NER)

tasks [73], [75]. This is because invoices often contain

text sequences that are vastly diﬀerent from natural

language, and speciﬁc information extraction methods

have been proposed to consider the speciﬁc structures

in these documents. For example, [31] uses a star graph

to consider the neighborhood of a text token, allowing

for the context of a token to be taken into account when

extracting information. This is a powerful method as it

allows for meaningful information to be extracted from

the document.

Several surveys provide an overview of general pro-

cessing techniques for image documents, such as OCR

techniques [59], text detection techniques [16], [147],

NER approaches [75], [95], [145], and table processing

[30], [38], [64]. However, few papers provide general

considerations for invoice processing. One such paper

is [52], which does not cover table extraction. In

[6], a very interesting end-to-end system is proposed

2VOLUME 10, 2022

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3360528

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Saoutet al.: An Overview of Data Extraction from Invoices

for processing invoices, including the diﬀerent above-

mentioned steps. Choices are made to select relevant

techniques and the resulting system focuses on key ﬁelds

extraction. From these considerations, our motivation

is to oﬀer a more comprehensive overview of available

methods that practitioners can use to design end-to-

end solutions for invoice processing. Note that, we also

pay particular attention to recent approaches based on

graph representations. Please let us mention that our

study is rooted in a practical experience, underpinned by

the eﬀective implementation of an electronic document

management system in partnership with a company.

This overview aims to examine data extraction in the

context of automated invoice processing. In Section II,

we provide a comprehensive description of an invoice to

highlight the critical data and structures that require

attention. In our main section, Section III, we discuss

the diﬀerent components necessary for extracting this

data. These include the digitization of the invoice using

OCR (Section III-A), the development of a data extrac-

tion process (Section III-B), which involves recognizing

speciﬁc entities (Section III-C) and identifying tables

(Section III-D). Section III-E explores how geographical

information can be utilized, with a particular focus on

the use of graph-based representations.

Since such a survey involve numerous references, we

propose an appendix with bibliographic tables that

would help the reader to quickly identify the cited refer-

ences according to the above-mentioned organization of

the sections.

II. INVOICE MODELING

Deﬁning a suitable representation of an invoice is an

important step for clearly understanding its speciﬁca-

tions. An invoice can include inputting data such as the

invoice number, date, and amounts, as well as assigning

it to a speciﬁc customer or project. In [22], a semantic

network was used to describe the invoice domain by

diﬀerent levels of abstraction. Before going on through

invoice processing techniques, we propose here a model

that better focuses on relevant extraction tasks that

are expected to be handled by an invoice processing

application.

We chose to initially limit the scope of invoice ex-

traction. Figure 1 illustrates a basic sample of an

invoice, emphasizing key information sought by auto-

mated document processing tools. The extraction of

speciﬁc ﬁelds, such as the invoice date (highlighted in

the purple box), supplier address (in the orange box),

and organizational providers (within the cyan box),

is crucial. This survey places particular emphasis on

table extraction, as indicated by data enclosed in blue

and red boxes. Additionally, it is worth noting that

the invoice contains other pertinent information that

may be valuable for Named Entity Recognition (NER)

processes, including the identiﬁcation of both the sender

FIGURE 1. An invoice sample

and receiver. Figure 2 provides a comprehensive view of

the typical content of a invoice by means of an UML

class diagram.

Diﬀerent types of information must be highlighted

such as addresses, tables, dates, and actors (organi-

zations or individuals identiﬁed on the invoice). This

selected information seems coherent with the analysis

of multiple invoice models and the usual requirements

of the companies. One may identify 6 groups of data:

•Actors: individuals or companies involved in the

invoice, such as a customer or a supplier.

•Independent ﬁelds: ﬁelds whose value is not linked

to one of the other following ﬁelds and that often

represent essential data for the invoice.

•Information on the document: information speciﬁc

to the management of the document, such as its

name or identiﬁer in the ﬁle system, the dates of

creation and processing of the document – all the

data that are not extracted from the document but

that come from its processing.

•Addresses: addresses contained in the document,

with if possible precision on their types, billing

address, delivery, or sender for example

•Tables: data tables are essential in invoices. They

VOLUME 10, 2022 3

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3360528

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Saoutet al.: An Overview of Data Extraction from Invoices

often include several lines of invoiced items, prices,

quantities. . .

•Date: the set of dates, speciﬁc to the invoice pro-

cesses such as the date of the edition of the invoice,

the date of payment or of delivery.

FIGURE 2. An UML model for invoices

Among these data, tables are considered complex to

extract in this model because they often contain a large

amount of structured data that needs to be parsed

and understood. Companies need to perform veriﬁcation

operations on the table data, such as verifying VAT

amounts and rates, or ensuring that the sum of the table

lines matches the invoice amount. Eﬃcient methods

for extracting and analyzing table data are crucial due

to the time-consuming and error-prone nature of the

process.

III. INVOICE PROCESSING

As mentioned in the introduction, automated invoice

processing requires a complete chain of software tools

to automate the tasks involved in processing invoices.

Hence, we could consider the following key features:

1) Optical Character Recognition (OCR): OCR is

used to extract data from scanned or PDF invoices,

making them searchable and easily readable by the

system.

2) Machine Learning (ML): ML algorithms are widely

used to classify and extract data from invoices,

such as vendor information, invoice numbers, and

amount. They are also intended to be able to

extract structured information such as tables.

3) Workﬂow Automation: the system automatically

route invoices for approval, ﬂagging any discrep-

ancies or errors for manual review.

4) Integration with ERP: automated invoice process-

ing systems can integrate with enterprise resource

planning (ERP) systems, allowing for seamless

data transfer and real-time visibility into the in-

voice process.

5) Real-time Analytics: automated invoice process-

ing systems can provide real-time analytics and

reporting on invoice data, allowing businesses to

track and analyze their spending. This is strongly

related to business intelligence modules.

6) Compliance and Security: one may want to check

compliance with tax regulations, and protect sen-

sitive data through security measures such as

encryption and secure data storage.

In this overview we restrict our scope to information

extraction, considering raw scanned documents. Hence

we restrict ourselves to the ﬁrst two points of the above-

mentioned features.

A. OPTICAL CHARACTER RECOGNITION

OCR systems have a long history, starting with early

mechanical devices that were developed in the 1950s,

such as GISMO (built by Sheppard in 1951). During

the 1960s and 1970s, not much research was done on

OCR due to the errors and slow recognition speed of

the early systems [72]. However, during the past 40

years, there has been substantial research on OCR which

has led to the development of document image analysis,

multilingual, handwritten, and omni-font OCRs. Nev-

ertheless, OCR technology is still far from matching

human reading abilities and current research focuses

on improving accuracy and speed of diverse document

styles and languages, including complex languages.

Let us mention several state-of-the-art reviews [37],

[93] that were already synthesizing the work in the early

90s. The seminal roots of OCR can be explored by

reading the state-of-the-art of Mantas et al. [83]. On the

other hand, a good practical starting point for OCR can

be accessed through the work of Breue et al. [18], which

presents an open-source OCR solution. A recent state of

the art in OCR has been published in 2017 by Islam et

al. [59].

Hence, OCR is a crucial discipline in image interpre-

tation with highly important potential applications. A

major problem was handwritten character recognition

[89], including the need for a database. Note that im-

portant conferences were focusing on OCR since the 90s,

e.g. ICDAR [1] with dedicated workshops [49]. Neural

networks have then considered to overcome the previous

limitations. In [28], the use of projection proﬁle fea-

tures coupled with a back-propagation neural network

classiﬁer has proven highly eﬀective. Nowadays, neural

networks are widely used in OCR technologies. Let us

quote some recent works: in [96] the author consider

a signiﬁcantly extensive Urdu corpus ideally suited for

applications involving deep learning techniques, [62]

introduced end-to-end learning methods for recognizing

arithmetic expressions combining deep a convolutional

neural network and convolutional recurrent neural net-

4VOLUME 10, 2022

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3360528

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Saoutet al.: An Overview of Data Extraction from Invoices

work, in [66] the authors propose an exploration of char-

acter recognition, encompassing both monolingual and

multilingual contexts, utilizing both deep and shallow

architectural approaches.

Among the impressive number of works related to

OCR, let us mention the work of Mithe et al. [92]

that presents a solution using an OCR solution to

extract text and then send it to a voice synthesizer.

The main objective behind this solution is to produce a

solution that transforms an image into a speech on the

contained text in the picture. This article proves that

the processing of an image makes it possible to obtain

fully structured information.

Of course, it is also very important to clearly as-

sess the performance of OCR using suitable measures

and available benchmark sets [98]. Let us note here

that image processing techniques can be used to get

better initial documents, even before applying OCR.

Morphological operations, such as dilation, erosion, and

opening, are commonly used in image processing to

remove noise, blur, and skewness from document images.

These techniques have been applied to prepare images

for OCR and to locate text-containing parts in an image

[147], for instance using OpenCV [44].

Back to our structured information extraction con-

cern, a dedicated challenge has been recently proposed

by Huang et al. [58] at the ICDAR2019 conference. The

prize for the best paper was awarded to Zhong et al. [158]

which oﬀers a solution based on neural networks for the

recognition of certain entities related to the formatting

of documents.

In recent times, there has been ongoing research in

the ﬁeld of OCR. Let us mention a ﬁrst work [10] that

speciﬁcally focuses on the application of OCR for the

recognition of written texts within a medical context. A

promising development in OCR techniques aligns with

the progress in deep learning, as exempliﬁed by the work

of Minghao Li et al. [77]. In this work, the authors have

adapted the transformer architecture to address OCR

challenges and have presented a comprehensive bench-

mark featuring many contemporary techniques. This

reﬂects the dynamic evolution of OCR methodologies,

where advancements in deep learning play a pivotal role.

B. DATA EXTRACTION

Once the OCR has been applied, we are generally left

with a set of PDF documents that are expected to

be searchable and exploitable. Let us ﬁrst begin with

a general consideration of possible data extraction at

this stage. At ﬁrst glance, we may consider the visual

aspect of the document and the relative positions of the

information that it contains.

The work of Suzanne Liebowitz Taylor et al. [133]

presents an overview of the problem of document extrac-

tion from scanned documents. This article highlights the

problems of alignment of the text. It also highlights that

only part of the information is relevant to extract.

The global layout of the document has to be taken

into account [7]. Ahmad et al. [2] use the concept of

unstructured, semi-structured, or structured documents.

The work of Yao et al. [146] on the relationships between

entities, which is also unlabeled, also seems very rele-

vant. Sun et al. [131] present a solution for orienting

documents according to a speciﬁc entity (QR Code

in the article). These methods address two common

challenges in data extraction: document orientation and

scale. The invoices, which are in the form of images,

are ﬁrst preprocessed to remove any unnecessary back-

ground and to correct the angle of the invoice. Then, the

region containing the desired information on the invoice

is identiﬁed using template matching. Another system

(BINYAS) [16] performs document layout analysis for

document image processing. This system uses connected

components and pixel analysis for classifying elements

such as paragraphs, graphics, images, and tables in

the document. In [11] the authors propose a dataset

for unstructured invoice documents that covers a wide

range of layouts, which is designed to generalize key

ﬁeld extraction tasks for unstructured documents. The

dataset is evaluated using various feature extraction

techniques as well as Artiﬁcial Intelligence methods.

As already mentioned, tabular content extraction

from PDF documents is of great importance, in partic-

ular for beneﬁting from available open-source document

repositories [30]. The extraction and processing of data

from PDF ﬁles have indeed always been studied [81].

Data in tables is often displayed in a tabular format.

Although tables may appear simple, extracting and

processing them from PDFs can be diﬃcult and require

complex computational methods [48]. The purpose is

often to produce new formats from initial PDFs such

as XML ﬁles [113]. Note that PDFs do not typically

record the structure of their graphical objects in their

description, although it could be done.

Of course, visual separators are important for identi-

fying tables in documents as they reveal the table struc-

tures [41]. Actually, when tables include visible lines

that can be extracted from the document, considering

the maximum independent set of rectangles (MISR)

problem seems relevant [24]. MISR consists of ﬁnding

in a set of rectangles the smallest set of rectangles with

no intersection. Unfortunately, many tables miss lines

to separate some columns or rows and some techniques

do not apply in these cases. Yildiz et al [149] present

approaches based on line intervals and columns to iden-

tify the entities corresponding to tables’ cells. Note that

table extraction will be detailed in Section III-D.

Deep learning techniques are now widely used to

identify and extract tables in PDF documents [46], [152].

This aspect will be detailed later. Note that some work

uses APIs such as PDFminer to transform PDF into

VOLUME 10, 2022 5

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3360528

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Saoutet al.: An Overview of Data Extraction from Invoices

XML and perform supervised learning on XML [104].

C. ADDRESSING SPECIFIC INFORMATION EXTRACTION:

NAMED ENTITY RECOGNITION

In the scope of this study, we are not concerned with

general document processing but with invoices that

are restricted to a speciﬁc domain, whose terms and

concepts are known. Hence we are concerned by the

semantics of the documents. The analysis of invoices is

hence related to Natural Language Processing (NLP)

and more speciﬁcally to Named Entity Recognition

(NER) (see [95], [145] for dedicated surveys).

The problem of named entity recognition (NER) was

presented by Marsh and Perzanowski at the MUC con-

ference [85]. NER involves labeling a text by associating

each character string with a speciﬁc category, such as

a person, location, organization, temporality, amount,

or percentage. This problem is also referred to as entity

labeling or entity extraction. Research intensiﬁes then on

this purpose. During CoNLL-2003 [135] the focus was

put on language-independent named entity recognition.

The challenge concentrates on four types of named

entities: persons, locations, organizations and names of

miscellaneous entities that do not belong to the previous

three groups. During same period, the ACE program’s

goal [35]) was to advance technology for automatically

extracting information from human language data. This

includes identifying mentioned entities, determining the

relationships between these entities as expressed in the

text, and recognizing the events in which these entities

are involved. The program encompasses various data

sources.

At this time the NER was restricted to the names of

people, locations, and organizations, and sometimes to

some other proper names, which does not cover all the

possible types expected in an invoice.

The speciﬁc set of labels used in NER depends on

the data and the task at hand. NER is, of course,

strongly dependent on the application domain (e.g.,

[107], [154]). Some researchers limit themselves to the 6

initial categories (person, location, organization, tempo-

rality, amount, and percentage) and believe that these

labels are suﬃcient for all NER tasks. However, other

researchers argue that speciﬁc labels may be necessary to

eﬀectively solve speciﬁc NER tasks [20]. The number of

labels used can vary depending on the complexity of the

data and the speciﬁc requirements of the task. Therefore,

the choice of labels is often a trade-oﬀ between the need

for speciﬁc information and the complexity of the model.

Let us particularly mention the works of Alfonseca et

al. [3] and R. Evans [40] that use the notion of “open

domain”. Recently, data sets have been made available

for NER related to invoices [11]. From a practical point

of view, Mikolov et al. [90] demonstrate the beneﬁt of

using vector representation of words and also that it

is possible to train a model of neural networks on a

large training set, including a large number of sentences

with approximately one billion words and a vocabulary

of more than one million diﬀerent words. A month

later, Mikolov et al. [91] considered a distributed rep-

resentation of words and prove that, by adding certain

vectors of words, the learning process allows one to

learn the meaning of the words. The linguist Scharolta

Katharina Siencnik [126] attempts then to demonstrate

the possible application of these algorithms to named

entity recognition.

While state-of-the-art named entity recognition sys-

tems relied heavily on hand-crafted features and domain-

speciﬁc knowledge, new neural architectures for NER

were proposed [27], [73]. These architectures aim to

improve performance by leveraging the strengths of

neural networks, such as their ability to learn useful

features from data, while still addressing some of the

limitations of previous methods. Convolutional neural

networks (CNN) [79] have been then considered with

NER problems [139], [151] as well as bidirectional net-

works [4]. Let us mention work on the identiﬁcation of

depression according to the answers of the patient in an

interview [115] as well as the work of He et al. [54] to

establish distant dependencies between the entity terms

via the processing of CNN.

ELMo is a language model that was developed by

Matthew E. Petters and his team [105]. Unlike tradi-

tional word embeddings that represent words as ﬁxed

vectors, ELMo utilizes the context in which words ap-

pear to generate more dynamic and informative word

embeddings. The model is semi-bidirectional, which

means it takes into account both the preceding and

succeeding words in a sentence to better understand the

meaning of the word it is trying to represent.

ELMo’s innovative approach to word embeddings

quickly gained attention from researchers in the natural

language processing (NLP) community. Dogan et al. [36]

applied ELMo’s neural network architecture to tackle

Named Entity Recognition (NER) problems, which in-

volve identifying and classifying entities in text such

as names, dates, and locations. While ELMo showed

promising results, it had a limitation that it could not be

eﬀectively ﬁne-tuned with other models using a "masked

language model".

To address this shortcoming, Devlin et al. proposed

BERT [34], a bidirectional model also based on ELMo,

which has since become one of the most widely used pre-

training models for NLP tasks. BERT uses a "masked

language model" that allows it to reﬁne its representa-

tions using an unsupervised pre-training method. This

makes it possible for BERT to generate high-quality

word embeddings that can be ﬁne-tuned with other NLP

models to achieve state-of-the-art performance on a wide

range of language tasks.

Ali Safaya et al. [116] demonstrate the possible as-

sociation between CNN and BERT and study its ef-

6VOLUME 10, 2022

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3360528

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Saoutet al.: An Overview of Data Extraction from Invoices

ﬁciency. This work focuses on BERT associated with

Arabic, Turkish, and Greek languages, which presents

a more structured construction than some languages.

This study achieves better eﬃciency in the recognition

of hateful content for these languages.

GPT models have become essential in natural lan-

guage processing (NLP) due to their ability to be ﬁne-

tuned for speciﬁc NLP tasks. Radford’s work on lan-

guage model transformers, particularly the GPT model

[111], has revolutionized the ﬁeld of NLP. Unlike bi-

directional models like BERT and ELMo, GPT is a

unidirectional model where word embeddings are only

enhanced in one direction, typically from left to right.

This unidirectional architecture makes GPT particularly

useful in language prediction tasks, where the model

predicts the next word in a sentence based on the

preceding words.

Getting back closer to our main concern, Francis et al.

[43] present a solution for extracting data from ﬁnancial

or medical documents using a neural network trained for

named entity recognition, which evaluates the eﬃciency

of a character-based model or on a word. One also has

to consider general language processing. For instance,

the work of Suárez et al. [130] on the state of the art of

named entity recognition for the French language can

be useful for dealing with French invoices. Hamdi et

al. [52] present tools to improve the learning of invoice-

speciﬁc labeling by reducing the cost of time and human

intervention.

To ensure better explainability, rule-based approaches

are useful alternative techniques for achieving NER

[26]. Shreeshiv et al. [103] address the extraction of

key parameters of the invoice (KPIE), by proposing a

rule-based approach and an approach based on neural

networks to recognize these parameters of the invoice.

Declarative approaches based on constraint solving

should also be considered as promising research direction

[5].

Practical solutions are available for NER, such as

that of Nanonets3.com and ABBYY. A well-documented

example explains the use of BERT in the case of a NER

[34].

In summary, there are two main approaches. The

historical rules-based approach tends to be inspired by

the rules of traditional grammar for labeling words in

the context of the text. This approach is very eﬃcient

on speciﬁc domains because the writing of the rules

is often very oriented towards the desired domain to

avoid ambiguity. Nevertheless, this specialization leads

to processing diﬃculties for the new context, not deﬁned

during the implementations. It is also necessary to

rework the model to extend its capacity. This step often

requires the intervention of an expert.

3https://www.nanonets

The neural network approach to label the entities

of our document seems interesting to avoid spending

too much time deﬁning the labeling rules. This method

better manages the new domains and we can more easily

set up automation of the relearning for the new concepts

treated. Nevertheless, NN requires huge computational

resources and training corpora to be eﬃcient.

In Figure 3 we propose an empirical evaluation of NER

systems according to the state of the art, the statements

of the various specialists in this ﬁeld, and the needs

encountered in companies. This evaluation is therefore

subjective.

FIGURE 3. Advantages and disadvantages of NER methods according to

the state of the art

D. FOCUS ON TABLE EXTRACTION

Examining more precisely invoices leads to consider

that most of them include tables as a main structural

character. Hence, table detection within invoices appears

as an important processing task [122]. Table processing

is indeed an old challenge (the 2004 survey [150] propose

already an overview of the ﬁeld) but these challenges are

still active [45].

Understanding information embedded into tables in-

volves three steps as quoted in [61]: detecting the table

boundaries, identifying the structure of the table includ-

ing rows, columns, and cell positions, and recognizing

the contents of the table (tokens of information that are

expected to be presented in a more readable format).

The layout is an important aspect [69]. Techniques used

for detection include object detection models [23] like

Faster-RCNN (Region Based Convolutional Neural Net-

works) and Mask-RCNN [108] and NLP-based methods

that incorporate both textual and visual features [57].

Note that TableBank [76] includes a new image-based

table detection and recognition dataset. PubLayNet

[158] can accurately recognize the layout of scientiﬁc

articles after training on over one million PDF articles.

LayoutLMv3 [57] is pre-trained with a word-patch align-

ment objective to improve cross-modal alignment. This

allows the model to predict whether the image patch

associated with a text word has been masked.

VOLUME 10, 2022 7

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3360528

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Saoutet al.: An Overview of Data Extraction from Invoices

Deep learning techniques are now widely used for

achieving table structure recognition. Recently, Kava-

sidis et al [67] introduce a fully-convolutional neu-

ral network that utilizes saliency-based techniques for

multi-scale reasoning with visual cues. They also in-

corporate a fully-connected conditional random ﬁeld

to precisely locate tables and charts within digital or

digitized documents. A common approach consists of

using a bi-directional RNN with Gated Recurrent Units

(GRUs) to process image data [68]. The pre-processing

step is used to form the image data so that it can be fed

into the network. The bi-directional RNN with GRUs

is then used to analyze the image data and extract

features. Finally, a fully connected layer with a softmax

activation function is used to classify the image based

on the features extracted by the RNN. Gilani et al. [47]

introduced an approach based on deep learning to detect

tables. Our method begins by pre-processing document

images, which are then input into a Region Proposal

Network (RPN), followed by a fully connected neural

network to identify tables. Their method demonstrates

remarkable precision when applied to document images

with diverse layouts, encompassing documents, research

papers, and magazines. Vine et al. [137] introduce a

two-step approach including a generative adversarial

network (GAN) and a genetic algorithm to optimize

a distance measure between candidate table structures.

Another two-step process that uses cell detection and

interaction modules to recognize the structure of a

table is proposed in [112]. The cell detection module

is used to locate and identify individual cells in the

table image. The interaction module then predicts the

associations between the detected cells, such as their

row and column associations. This approach can be

useful for determining the overall structure of a table,

including the number of rows and columns, as well

as the relationships between cells within the table.

Convolutional networks have, of course, been explored

[67], [125], with Split and Merge models [134]. In [99],

the authors consider also explainability as an issue in

an NN. Global end-to-end solutions are now available

TableNet [100], DeepDeSRT [120] PubTabNet [156] or

GTE [155]. Dedicated benchmarks repository have been

proposed to evaluate these methods : Tablebank [76]

(417K high quality labeled tables) and even a novel

dataset derived from the RVL-CDIP invoice data [114].

Table detection may also rely on more speciﬁc knowl-

edge. In [140], the authors propose a system for auto-

matically generating ground truth data for training table

detection algorithms. We found in the literature impor-

tant works on layouts, for example in [106], David P.

al use “Conditional Random Fields” (CRF) to compose

diﬀerent layouts of a table that can sometimes overlap

and may be misinterpreted by other modeling languages.

Tools such as TableSeer [80] searches for forms that can

correspond to tables to extract them and be able to

execute queries on their contents.

The speciﬁc structures of invoices lead to consider-

ing the geographical organization of the document and

graph-based models are thus relevant [122]. Recent work

[65] proposes an approach to detect the general frame of

a table and extract its content. Focusing on more speciﬁc

tables, their characteristics are also intended to help

these tasks, such as headers [121]. Rule-based systems,

which were seminal table extraction techniques, may also

be relevant [124].

Graph-based approaches also seem to be a natural

way to handle tables. In [117] the authors use graph

mining for extracting tables using key ﬁelds. Hence,

Graph Neural Networks (GNN) [119] appears as natural

to handle graph-based knowledge [148]. Graph Neural

Networks (GNNs) can indeed capture the local repeating

structural information in invoice document tables [114].

In [78], the authors propose a method based on GNN to

mix position and text. Their algorithm also uses visual

recognition to predict the right numbers of columns and

lines. In [109] architecture that combines the beneﬁts

of convolutional neural networks for visual feature ex-

traction and graph networks is introduced for dealing

with the problem structure. Cell detection and cell logic

are used to predict the location of the cells in [144]. [153]

presents a uniﬁed framework that utilizes a combination

of vision, semantics, and relations for analyzing docu-

ment layouts, supporting natural language processing

and computer vision-based methods. Slightly diﬀerent,

LGPMA [110] employs a soft pyramid mask learning

approach to recover table structure by analyzing both

local and global feature maps. Additionally, it considers

the location of empty cells during this process.

E. HANDLING GEOGRAPHIC INFORMATION IN THE

INVOICES: POSSIBLE PERSPECTIVES FOR GRAPH-BASED

REPRESENTATIONS

Since the layout of invoices is particularly relevant as

described above, let us explore the modeling and the

processing of geometric or geographic information, to

discover links that cannot be handled by a purely seman-

tic analysis of the document. For example, an invoice

may contain a keyword and its expected associated value

close to it. Let us review some methods for representing

and exploring this structured data. For instance, Esser et

al [39] try to extract templates from scanned documents.

This section is devoted to methods that would not

consider image processing or NN to handle the global

layout of the document using a training process. We are

merely interested in techniques based on representation

models and associated solving techniques to process

geometric data in a more frugal (without the need for a

huge and costly training) and more declarative way.

A long time ago, Cesarini et al. [21] were already

interested in the structural analysis of a document by

trying to label areas. They consider that an invoice is a

8VOLUME 10, 2022

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3360528

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Saoutet al.: An Overview of Data Extraction from Invoices

set of regions that can be identiﬁed using their relative

geometrical position.

As mentioned in Section III-D, graph-based represen-

tation has been explored for handling the structures of

the tables in documents. Therefore, we focus here on

such representations and how they can be exploited to

eﬃciently retrieve table structures and their content.

Since the structure of a table may contain diﬀerent

levels, we argue that several levels of abstraction are

needed to represent the geometrical structure of a table.

Using models with geometric constraints and enabling

their declarative handling has been explored in [19].

An abstract model is linked to a graphic model and a

reﬁnement process is proposed. Geometric constraints

[94] require dedicated constraint solvers according to

targeted domains. In [118], we propose an approach

based on hypergraph to handle table extraction. Hyper-

graphs [15] are classic extensions of graphs and enable

more powerful models. Hence, after suitable modeling,

one may consider table extraction in a document as

an isomorphism problem in hypergraphs [14]. The sub-

isomorphism problem is NP-Complete [29] and its com-

plexity has been reﬁned according to parameters [86].

Solvers, such as the Glasgow solver [87] are available to

solve this problem as well as eﬃcient algorithms [128]

including recent quantum search algorithms [84]. In a

recent work [74], the author proposes to represent tables

as planar graphs with cell regions as their faces. They

generate junction conﬁdence maps and line ﬁelds using

heatmap regression networks. Their approach mixes

deep NN and constrained optimization problems.

F. TURNING TO EFFICIENT SOLUTIONS FOR INDUSTRY

As a starting point, it might be worthwhile to delve

into the intricacies of Extraction, Transformation, and

Loading (ETL) processes [136], which form the back-

bone of operations within a data warehouse architecture,

with the aim of acquiring data from diverse document

sources, each characterized by its potential multimodal

attributes. A critical dimension in this context is the

recognition that data assimilation stems from a variety

of document origins. The multifaceted nature of these

documents underscores the complexity of the task at

hand. Furthermore, automated document processing

systems must exhibit the capability to update data at

regular intervals, emphasizing the need for real-time

adaptability. Following these lines, Figure 4 encapsu-

lates the multifunctional essence of information extrac-

tion from invoices. It provides a visual representation

of the intricate multitasking inherent in the information

extraction workﬂow.

Some industry solutions oﬀer a partial solution to the

total ETL process. They are based on plugins designed

for each information retrieval task. For instance, the

FIGURE 4. Different steps for in an ETL process

Azure4solution developed by Microsoft oﬀers numerous

APIs for processing documents including OCR and

NER. The ABBY solution is split into diﬀerent programs

: Flexicapture for OCR and FlexiLayout for extracting

data from a document using templates.

Transformers maybe now used to provide end to

end solutions and address various modalities related to

document processing tasks, such as classiﬁcation, ques-

tion answering or NER [32], [70]. The diverse nature

of documents necessitates multimodal reasoning that

encompasses various types of inputs [8]. These inputs,

including visual, textual, and layout elements, are found

in a variety of document sources. These aspects may

be considered for developing eﬃcient invoices processing

tools.

IV. CONCLUSION

In conclusion, invoices are crucial documents for compa-

nies as they serve as proof of purchase and are necessary

for accounting and tax purposes. The processing of

invoices can be time-consuming and prone to errors,

but recent advances in technology have led to the de-

velopment of systems that automate the process. These

systems use a combination of OCR, NLP, and machine

learning techniques to digitize paper invoices and extract

relevant information. The processing of invoices involves

diﬀerent steps such as document digitization, informa-

tion extraction, and data validation, and speciﬁc work-

ﬂows are often used to ensure eﬃciency and accuracy.

The challenge of processing invoices lies in handling the

variability of layouts, language, and terminology, and

the presence of errors or inaccuracies in the data.

In this survey, we have reviewed the essential com-

ponents that must be taken into account when de-

veloping an automated invoice processing system. Our

goal is to provide valuable insights to researchers and

engineers striving to create end-to-end solutions, and

in this pursuit, several critical factors demand careful

consideration:

•Document Quality: The quality of the documents

input for processing plays a crucial role. Standard

digitized invoices can often be handled with rela-

tively basic OCR systems. However, when dealing

4https://azure.microsoft.com/en-us

VOLUME 10, 2022 9

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3360528

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Saoutet al.: An Overview of Data Extraction from Invoices

with documents exhibiting orientation issues or

containing handwritten sections, a more sophisti-

cated image processing pipeline and highly eﬃcient

text recognition are imperative. Real-world ﬁnan-

cial documents, for instance, may feature handwrit-

ten notes from employees seeking reimbursements,

making document quality a critical determinant.

•Invoice Content: The nature of the invoice content

is another crucial consideration. In cases where

invoices consist of limited and concise information,

without extensive descriptions or intricate commer-

cial terms, employing simple Named Entity Recog-

nition (NER) techniques based on a compact model,

as exempliﬁed in Figure 2, suﬃces. Conversely, for

more complex scenarios, the integration of Natural

Language Processing (NLP) techniques becomes

essential to delve into the semantic nuances of

scanned texts.

•Layout Diversity: The diversity of invoice layouts

cannot be underestimated. When documents are

associated with a ﬁnite number of suppliers or

clients, rule-based techniques designed to match

predeﬁned layouts can be harnessed. Moreover,

these techniques may oﬀer ﬂexibility, allowing end-

users to ﬁne-tune the system to visually locate and

extract key information from invoices.

•Annotated Data Sets: Machine learning techniques,

while powerful, rely heavily on sizable and repre-

sentative training datasets for optimal performance.

As mentioned in this survey, rule-based approaches

can often be generic enough to process invoices ef-

fectively without necessitating extensive supervised

learning processes.

•Table Diversity and Quality: Tables within invoices

represent a pivotal aspect of the processing pipeline.

While basic tables can be detected using image

processing and neural network-based algorithms,

more complex scenarios emerge when tables are

incomplete and exhibit considerable diversity, often

due to variations in invoice layouts. In such cases,

recent graph-based algorithms present a compelling

and eﬃcient alternative.

By taking these facets into account, engineers can

embark on the development of robust, eﬃcient, and

adaptable automated invoice processing systems that

cater to a wide spectrum of real-world invoice scenarios.

In this context, hybrid methods, combining both rule-

based and neural network approaches.

In recent times, there has been a notable emergence

of large language models (LLM). These models present

promising prospects for document processing by inte-

grating structural and semantic recognition to achieve

eﬀective extraction of information from both structured

and semi-structured documents.

V. APPENDIX : BIBLIOGRAPHIC TABLES

Main topic References

OCR techniques [37], [59], [66], [83], [93], [98]

[10], [77]

Text detection techniques [16], [147]

NER approaches [75], [95], [107], [145], [154]

Table processing [30], [38], [64], [150]

Convolutionnal networks [79]

Invoice processing [52]

Graph neural networks [148]

Information retrieval [51]

TABLE 1. Summary of cited surveys

Name Desc. Ref.

CORD Receipt Dataset for [101]

Post-OCR Parsing

rvl-cdip-invoice set of invoices [53]

extracted from RVL-CDIP

GHEGA-DATASET labeled dataset for [88]

document understanding

research experiments

ICDAR2019 competition on scanned [58]

receipt ocr and

information extraction.

FUNSD Form Understanding [60]

in Noisy Scanned

Documents challenge

PubLayNet dataset for document [158]

layout analysis

TABLE 2. Summary of available datasets for document analysis and

recognition

Name Desc. Ref.

cTDar annotated documents [45]

with table entities

SciTSR large-scale table structure [25]

recognition dataset

PubTabNet image-based table recognition [157]

WikiTableSet publicly available [82]

image-based table recognition

dataset in three languages

built from Wikipedia

TABLE 3. Summary of available datasets for table analysis

Reference Topic

[18] open source OCR solution

[89] handwritten character recognition

[28] handwritten OCR

[96] text recognition using deep learning

[62] deep learning based OCR

[92] OCR solution including image to speech

transformation

[98] benchmark sets for OCR

[44] OpenCV system

[158] neural network based OCR

[58] description of Icdar2019 competition on scanned

receipt

[10] A survey into OCR specialized for medical reports.

[77] A technique based on transformer architecture for

OCR and a benchmark with modern solutions

TABLE 4. Summary of main cited works on OCR

10 VOLUME 10, 2022

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3360528

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Saoutet al.: An Overview of Data Extraction from Invoices

Reference Topic

[133] seminal work on data extraction

[7] computational-geometry algorithms for analyzing

document structures

[2] handling multiple types of data structures

[146] considering relations between data

[131] orientation of documents

[16] document layout analysis

[11] data sets for evaluation

[81] seminal work on pdf documents management

[48] data extraction from tables

[113] table extraction for pdf documents

[41] table detection for multipage pdf documents

[24] solving of the maximum independent set of

rectangles problem

[149] pdf2table : method for extracting table

[46] graph neural network for extracting tables from

pdf documents

[152] deep learning for pdf table extraction

[104] presentation of TAO for table detection and

extraction

TABLE 5. Summary of main cited works on data extraction

Reference Topic

[85] seminal work on NER

[135] NER Challenge at CoNLL

[35] ACE program : challenge for NER systems

[20] empirical study of NER

[3] procedure to automatically extend an ontology

with domain speciﬁc knowledge

[40] system for NER in the open domain

[90] model architectures for computing continuous

vector representations of words (word2vec)

[91] distributed architectures for word2vec

[126] adaptation of word2vec to NER

[73] neural networks for NER

[27] neural networks for NER

[4] bidirectional recurrent neural network for NER

[34] presentation of BERT

[54] combination of convolutional neural network

with BERT

[115] application of bert-cnn for an application in

health care

[105] presentation of ELMO, a model language word

representation

[36] use of ELMO for NER

[116] Bert-cnn for speech identiﬁcation

[111] enhancing language comprehension

through pre-training

[43] data extraction from ﬁnancial documents

[130] state of the NER for French language

[52] speciﬁc work on invoices

[26] rule-based information extraction systems

[103] information extraction from scanned invoices

[5] constraint satisfaction for invoice processing

ABBY a commercial system for NER

TABLE 6. Summary of main cited works on NER

Ref. Topic

[122] reference work on table extraction

[45] ICDAR 2019 Competition on Table Detection

and Recognition

[69] the T-Recs system for table recognition

[23] algorithm for searching parallel lines in documents

to extract tables

[61] proposal for the representation of tables

(Wang Notation Tool)

[158] Publaynet, a data bank for table extraction

[76] TableBank, a data bank for table extraction

[108] presentation of CascadeTabNet an end to end system

using Convolutional Neural Networks

[57] LayoutLMv3: a general-purpose pre-trained model

for documents

[67] Convolutional Neural Network for table detection

[68] approach based on bi-directional gated recurrent

unit networks

[47] deep learning for table detection

[137] approach based on a generative adversarial network

[112] two step approach that combines cell detection and

interaction module

[125] DeepTabStR : a deep learning based system for

table recognition

[134] use of a novel deep learning models

(Split and Merge models)

[99] explainability to get the semantic structures of tables

[100] Tablenet : end to end solution for table extraction

[120] DeepDeSRT : end to end solution for table extraction

[156] PubTabNet : end to end solution for table extraction

[155] GTE : end to end solution for table extraction

[114] use of Graph Neural Network for table extraction

[140] system for automatically generating ground truth data

for training table detection algorithms

[106] introduction of conditional random ﬁelds to manage

layouts of a table

[80] presentation of TableSeer, a search engine for tables

[65] an end-to-end table structure recognition system

using a Yolo-based object detector

[121] segmentation techniques for tables

[124] presentation of TabbyPDF: heuristic-based approach

to table detection and structure recognition

[117] approach that uses a graph-based

representation of documents

[109] architecture that combines convolutional neural

networks and graph networks

[144] presentation of TGRNet an end-to-end trainable

table graph reconstruction network

[153] presentation of VSR a combination of computer

vision and NLP techniques

[110] LGPMA a system that uses the concept

of Local and Global Pyramid Mask Alignment

TABLE 7. Summary of main cited works on table Extraction

VOLUME 10, 2022 11

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3360528

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Saoutet al.: An Overview of Data Extraction from Invoices

REFERENCES

[1] ICDAR 2nd International Conference Document Analysis

and Recognition. IEEE Computer Society, 1993.

[2] Ily Amalina Sabri Ahmad and Mustafa Man. Multiple types

of semi-structured data extraction using wrapper for extrac-

tion of image using dom (weid). In Regional Conference on

Science, Technology and Social Sciences (RCSTSS 2016),

pages 67–76. Springer, 2018.

[3] Enrique Alfonseca and Suresh Manandhar. An unsupervised

method for general named entity recognition and automated

concept discovery. In Proceedings of the 1st international

conference on general WordNet, 2002.

[4] Mohammed N. A. Ali, Guanzheng Tan, and Aamir Hussain.

Bidirectional recurrent neural network approach for arabic

named entity recognition. Future Internet, 10(12):123, 2018.

[5] Jakob Andersson. Automatic invoice data extraction as a

constraint satisfaction problem, 2020.

[6] Halil Arslan. End to end invoice processing application

based on key ﬁelds extraction. IEEE Access, 10:78398–

78413, 2022.

[7] Henry S Baird. Background structure in document images.

International Journal of Pattern Recognition and Artiﬁcial

Intelligence, 8(05):1013–1030, 1994.

[8] Souhail Bakkali, Zuheng Ming, Mickaël Coustaty, Marçal

Rusiñol, and Oriol Ramos Terrades. Vlcdoc: Vision-

language contrastive pre-training model for cross-modal

document classiﬁcation. Pattern Recognit., 139:109419,

2023.

[9] Chiara Bardelli, Alessandro Rondinelli, Ruggero Vecchio,

and Silvia Figini. Automatic electronic invoice classiﬁcation

using machine learning models. Mach. Learn. Knowl. Extr.,

2(4):617–629, 2020.

[10] Pulkit Batra, Nimish Phalnikar, Deepesh Kurmi, Jitendra

Tembhurne, Parul Sahare, and Tausif Diwan. Ocr-mrd: Per-

formance analysis of diﬀerent optical character recognition

engines for medical report digitization. 2023.

[11] Dipali Baviskar, Swati Ahirrao, and Ketan Kotecha. Multi-

layout invoice document dataset (MIDD): A dataset for

named entity recognition. Data, 6(7):78, 2021.

[12] Dipali Baviskar, Swati Ahirrao, Vidyasagar M. Potdar,

and Ketan Kotecha. Eﬃcient automated processing of

the unstructured documents using artiﬁcial intelligence: A

systematic literature review and future directions. IEEE

Access, 9:72894–72936, 2021.

[13] Abdel Belaïd, Vincent Poulain D’Andecy, Hatem Hamza,

and Yolande Belaïd. Administrative document analysis

and structure. In Marenglen B. and Fatos X., editors,

Learning Structure and Schemas from Documents, Studies

in Computational Intelligence. Springer, 2011.

[14] Claude Berge. Isomorphism problems for hypergraphs. In

Hypergraph Seminar, pages 1–12. Springer, 1974.

[15] Claude Berge. Graphs and Hypergraphs. Elsevier Science

Ltd, 1985.

[16] Showmik Bhowmik, Ram Sarkar, Mita Nasipuri, and

David S. Doermann. Text and non-text separation in

oﬄine document images: a survey. Int. J. Document Anal.

Recognit., 21(1-2):1–20, 2018.

[17] Carlo Biagioli, Enrico Francesconi, Andrea Passerini, Si-

monetta Montemagni, and Claudia Soria. Automatic se-

mantics extraction in law documents. In Proceedings of the

10th international conference on Artiﬁcial intelligence and

law, pages 133–140, 2005.

[18] Thomas M Breuel. The ocropus open source ocr system. In

Document recognition and retrieval XV, volume 6815, page

68150F. International Society for Optics and Photonics,

2008.

[19] Théo Le Calvar, Fabien Chhel, Frédéric Jouault, and

Frédéric Saubion. Toward a declarative language to generate

explorable sets of models. In Proceedings of the 34th

ACM/SIGAPP Symposium on Applied Computing, pages

1837–1844, 2019.

[20] Helena Ceovic, Adrian Satja Kurdija, Goran Delac, and

Marin Silic. Named entity recognition for addresses: An

empirical study. IEEE Access, 10:42094–42106, 2022.

[21] Francesca Cesarini, Enrico Francesconi, Marco Gori, Simone

Marinai, JQ Sheng, and Giovanni Soda. Rectangle labelling

for an invoice understanding system. In Proceedings of the

Fourth ICDAR. IEEE, 1997.

[22] Francesca Cesarini, Enrico Francesconi, Simone Marinai,

Jianqing Sheng, and Giovanni Soda. Conceptual modelling

for invoice document processing. In Roland R. Wagner,

editor, Eighth International Workshop on Database and

Expert Systems Applications, DEXA ’97, Toulouse, France,

September 1-2, 1997, Proceedings, pages 596–603. IEEE

Computer Society, 1997.

[23] Francesca Cesarini, Simone Marinai, L. Sarti, and Giovanni

Soda. Trainable table location in document images. In 16th

International Conference on Pattern Recognition, ICPR

2002, Quebec, Canada, August 11-15, 2002, pages 236–240.

IEEE Computer Society, 2002.

[24] Parinya Chalermsook and Julia Chuzhoy. Maximum inde-

pendent set of rectangles. In Proceedings of the twenti-

eth annual ACM-SIAM symposium on discrete algorithms,

pages 892–901. SIAM, 2009.

[25] Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanx-

uan Yin, and Xian-Ling Mao. Complicated table structure

recognition. arXiv preprint arXiv:1908.04729, 2019.

[26] Laura Chiticariu, Yunyao Li, and Frederick Reiss. Rule-

based information extraction is dead! long live rule-based

information extraction systems! In Proceedings of the

2013 conference on empirical methods in natural language

processing, pages 827–832, 2013.

[27] Jason PC Chiu and Eric Nichols. Named entity recognition

with bidirectional lstm-cnns. Transactions of the Associa-

tion for Computational Linguistics, 4:357–370, 2016.

[28] Amit Choudhary, Rahul Rishi, and Savita Ahlawat. Un-

constrained handwritten digit ocr using projection proﬁle

and neural network approach. In Proceedings of the In-

ternational Conference on Information Systems Design and

Intelligent Applications 2012 (INDIA 2012) held in Visakha-

patnam, India, January 2012, pages 119–126. Springer, 2012.

[29] Stephen A Cook. The complexity of theorem-proving proce-

dures. In Proceedings of the third annual ACM symposium

on Theory of computing, pages 151–158, 1971.

[30] Andreiwid Sheﬀer Corrêa and Pär-Ola Zander. Unleashing

tabular content to open data: A survey on PDF table

extraction methods and tools. In Charles C. Hinnant and

Adegboyega Ojo, editors, Proceedings of the 18th Annual

International Conference on Digital Government Research,

DG.O 2017, Staten Island, NY, USA, June 7-9, 2017, pages

54–63. ACM, 2017.

[31] Vincent Poulain D’Andecy, Emmanuel Hartmann, and

Marçal Rusiñol. Field extraction by hybrid incremental and

a-priori structural templates. In 13th IAPR, DAS 2018,

pages 251–256. IEEE Computer Society, 2018.

[32] Brian L. Davis, Bryan S. Morse, Brian L. Price, Chris

Tensmeyer, Curtis Wigington, and Vlad I. Morariu. End-to-

end document recognition and understanding with dessurt.

In Leonid Karlinsky, Tomer Michaeli, and Ko Nishino,

editors, Computer Vision - ECCV 2022 Workshops - Tel

Aviv, Israel, October 23-27, 2022, Proceedings, Part IV,

volume 13804 of Lecture Notes in Computer Science, pages

280–296. Springer, 2022.

[33] Andreas Dengel and Bertin Klein. smartﬁx: A requirements-

driven system for document analysis and understanding. In

Daniel P. Lopresti, Jianying Hu, and Ramanujan S. Kashi,

editors, 5th DAS, volume 2423 of Lecture Notes in Computer

Science, pages 433–444. Springer, 2002.

[34] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina

Toutanova. Bert: Pre-training of deep bidirectional trans-

formers for language understanding. In Jill Burstein, Christy

Doran, and Thamar Solorio, editors, Proceedings of the

2019 Conference of the North American Chapter of the As-

sociation for Computational Linguistics: Human Language

Technologies, NAACL-HLT 2019, Minneapolis, MN, USA,

June 2-7, 2019, Volume 1 (Long and Short Papers), pages

4171–4186. Association for Computational Linguistics, 2019.

12 VOLUME 10, 2022

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3360528

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Saoutet al.: An Overview of Data Extraction from Invoices

[35] George R. Doddington, Alexis Mitchell, Mark A. Przybocki,

Lance A. Ramshaw, Stephanie M. Strassel, and Ralph M.

Weischedel. The automatic content extraction (ACE) pro-

gram - tasks, data, and evaluation. In Proceedings of the

Fourth International Conference on Language Resources

and Evaluation, LREC 2004, May 26-28, 2004, Lisbon,

Portugal. European Language Resources Association, 2004.

[36] Cihan Dogan, Aimore Dutra, Adam Gara, Alfredo Gemma,

Lei Shi, Michael Sigamani, and Ella Walters. Fine-grained

named entity recognition using elmo and wikidata. arXiv

preprint arXiv:1904.10503, 2019.

[37] Line Eikvil. Optical character recognition. citeseer. ist. psu.

edu/142042. html, 26, 1993.

[38] David W. Embley, Matthew Hurst, Daniel P. Lopresti,

and George Nagy. Table-processing paradigms: a research

survey. Int. J. Document Anal. Recognit., 8(2-3):66–86,

2006.

[39] Daniel Esser, Daniel Schuster, Klemens Muthmann, Michael

Berger, and Alexander Schill. Automatic indexing of

scanned documents: a layout-based approach. In Document

recognition and retrieval XIX, volume 8297, page 82970H.

International Society for Optics and Photonics, 2012.

[40] Richard Evans and Staﬀord Street. A framework for named

entity recognition in the open domain. Recent advances

in natural language processing III: selected papers from

RANLP, 260(267-274):110, 2003.

[41] Jing Fang, Liangcai Gao, Kun Bai, Ruiheng Qiu, Xin Tao,

and Zhi Tang. A table detection method for multipage PDF

documents via visual seperators and tabular structures. In

2011 International Conference on Document Analysis and

Recognition, ICDAR 2011, Beijing, China, September 18-

21, 2011, pages 779–783. IEEE Computer Society, 2011.

[42] Enrico Francesconi, Simonetta Montemagni, Wim Peters,

and Daniela Tiscornia. Semantic processing of legal texts:

Where the language of law meets the law of language,

volume 6036. Springer, 2010.

[43] Sumam Francis, Jordy Van Landeghem, and Marie-Francine

Moens. Transfer learning for named entity recognition in

ﬁnancial and biomedical documents. Information, 10(8):248,

2019.

[44] Ayushe Gangal, Peeyush Kumar, and Sunita Kumari.

Complete scanning application using opencv. CoRR,

abs/2107.03700, 2021.

[45] Liangcai Gao, Yilun Huang, Hervé Déjean, Jean-Luc Meu-

nier, Qinqin Yan, Yu Fang, Florian Kleber, and Eva Maria

Lang. ICDAR 2019 competition on table detection and

recognition (ctdar). In 2019 ICDAR, pages 1510–1515.

IEEE, 2019.

[46] Andrea Gemelli, Emanuele Vivoli, and Simone Marinai.

Graph neural networks and representation embedding for

table extraction in PDF documents. In 26th International

Conference on Pattern Recognition, ICPR 2022, Montreal,

QC, Canada, August 21-25, 2022, pages 1719–1726. IEEE,

2022.

[47] Azka Gilani, Shah Rukh Qasim, Muhammad Imran Malik,

and Faisal Shafait. Table detection using deep learning. In

14th IAPR, ICDAR 2017, pages 771–776. IEEE, 2017.

[48] Max C. Göbel, Tamir Hassan, Ermelinda Oro, Giorgio

Orsi, and Roya Rastan. Table modelling, extraction and

processing. In Robert Sablatnig and Tamir Hassan, editors,

Proceedings of the 2016 ACM Symposium on Document

Engineering, DocEng 2016, Vienna, Austria, September 13

- 16, 2016, pages 1–2. ACM, 2016.

[49] Venu Govindaraju, Prem Natarajan, Santanu Chaudhury,

and Daniel P. Lopresti, editors. Proceedings of the Inter-

national Workshop on Multilingual OCR, MOCR@ICDAR

2009, Barcelona, Spain, July 25, 2009. ACM, 2009.

[50] Hien Thi Ha and Ales Horák. Information extraction

from scanned invoice images using text analysis and layout

features. Signal Process. Image Commun., 102:116601, 2022.

[51] Kailash A Hambarde and Hugo Proenca. Information

retrieval: Recent advances and beyond. arXiv preprint

arXiv:2301.08801, 2023.

[52] Ahmed Hamdi, Elodie Carel, Aurélie Joseph, Mickaël Cous-

taty, and Antoine Doucet. Information extraction from

invoices. In Josep Lladós, Daniel Lopresti, and Seiichi

Uchida, editors, ICDAR 2021, Proceedings, Part II, volume

12822 of Lecture Notes in Computer Science, pages 699–714.

Springer, 2021.

[53] Adam W. Harley, Alex Ufkes, and Konstantinos G. Derpa-

nis. Evaluation of deep convolutional nets for document

image classiﬁcation and retrieval. In 13th International

Conference on Document Analysis and Recognition, ICDAR

2015, Nancy, France, August 23-26, 2015, pages 991–995.

IEEE Computer Society, 2015.

[54] Changai He, Sibao Chen, Shilei Huang, Jian Zhang, and

Xiao Song. Using convolutional neural network with bert

for intent determination. In 2019 International Conference

on Asian Language Processing (IALP), pages 65–70. IEEE,

2019.

[55] Kaijian He, Qian Yang, Lei Ji, Jingcheng Pan, and Yingchao

Zou. Financial time series forecasting with the deep learning

ensemble model. Mathematics, 11(4), 2023.

[56] David Hollingsworth. The workﬂow reference model, 1994.

[57] Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu

Wei. Layoutlmv3: Pre-training for document AI with uniﬁed

text and image masking. In João Magalhães, Alberto Del

Bimbo, Shin’ichi Satoh, Nicu Sebe, Xavier Alameda-Pineda,

Qin Jin, Vincent Oria, and Laura Toni, editors, MM ’22: The

30th ACM International Conference on Multimedia, Lisboa,

Portugal, October 10 - 14, 2022, pages 4083–4091. ACM,

2022.

[58] Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimos-

thenis Karatzas, Shijian Lu, and CV Jawahar. Icdar2019

competition on scanned receipt ocr and information extrac-

tion. In 2019 ICDAR, pages 1516–1520. IEEE, 2019.

[59] Noman Islam, Zeeshan Islam, and Nazia Noor. A survey on

optical character recognition system. Journal of Information

and Communication Technology (JICT), 2017.

[60] Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe

Thiran. FUNSD: A dataset for form understanding in

noisy scanned documents. In 2nd International Work-

shop on Open Services and Tools for Document Analysis,

OST@ICDAR 2019, Sydney, Australia, September 22-25,

2019, pages 1–6. IEEE, 2019.

[61] Piyushee Jha and George Nagy. Wang notation tool: Layout

independent representation of tables. In ICPR 2008, pages

1–4. IEEE Computer Society, 2008.

[62] Yuxiang Jiang, Haiwei Dong, and Abdulmotaleb El-Saddik.

Baidu meizu deep learning competition: Arithmetic opera-

tion recognition using end-to-end learning OCR technolo-

gies. IEEE Access, 6:60128–60136, 2018.

[63] Ralph H. Sprague Jr. Electronic document management:

Challenges and opportunities for information systems man-

agers. MIS Q., 19(1):29–49, 1995.

[64] Mahmoud Kasem, Abdelrahman Abdallah, Alexander

Berendeyev, Ebrahem Elkady, Mahmoud Abdalla, Mo-

hamed Mahmoud, Mohamed Hamada, Daniyar Nurseitov,

and Islam Taj-Eddin. Deep learning for table detection and

structure recognition: A survey, 2022.

[65] Tejas Kashinath, Twisha Jain, Yash Agrawal, Tanvi Anand,

and Sanjay Singh. End-to-end table structure recognition

and extraction in heterogeneous documents. Appl. Soft

Comput., 123:108942, 2022.

[66] Sukhandeep Kaur, Seema Bawa, and Ravinder Kumar. A

survey of mono- and multi-lingual character recognition

using deep and shallow architectures: indic and non-indic

scripts. Artif. Intell. Rev., 53(3):1813–1872, 2020.

[67] I. Kavasidis, C. Pino, S. Palazzo, F. Rundo, D. Giordano,

P. Messina, and C. Spampinato. A saliency-based convo-

lutional neural network for table and chart detection in

digitized documents. In Elisa Ricci, Samuel Rota Bulò, Cees

Snoek, Oswald Lanz, Stefano Messelodi, and Nicu Sebe,

editors, Image Analysis and Processing – ICIAP 2019, pages

292–302, Cham, 2019. Springer International Publishing.

[68] Saqib Ali Khan, Syed Muhammad Daniyal Khalid, Muham-

mad Ali Shahzad, and Faisal Shafait. Table structure

VOLUME 10, 2022 13

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3360528

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Saoutet al.: An Overview of Data Extraction from Invoices

extraction with bi-directional gated recurrent unit networks.

In 2019 ICDAR, pages 1366–1371. IEEE, 2019.

[69] Thomas Kieninger and Andreas Dengel. The t-recs table

recognition and analysis system. In Seong-Whan Lee and

Yasuaki Nakano, editors, Document Analysis Systems: The-

ory and Practice, Third IAPR Workshop, DAS’98, Nagano,

Japan, November 4-6, 1998, Selected Papers, volume 1655

of Lecture Notes in Computer Science, pages 255–269.

Springer, 1998.

[70] Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon

Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sang-

doo Yun, Dongyoon Han, and Seunghyun Park. Ocr-

free document understanding transformer. In Shai Avi-

dan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria

Farinella, and Tal Hassner, editors, Computer Vision -

ECCV 2022 - 17th European Conference, Tel Aviv, Israel,

October 23-27, 2022, Proceedings, Part XXVIII, volume

13688 of Lecture Notes in Computer Science, pages 498–

517. Springer, 2022.

[71] Mario Köppen, Dörte Waldöstl, and Bertram Nickolay.

A system for the automated evaluation of invoices. In

Jonathan J. Hull and Suzanne Liebowitz Taylor, editors,

DAS 1996, volume 29 of Series in Machine Perception and

Artiﬁcial Intelligence, pages 223–241. WorldScientiﬁc, 1996.

[72] Sargur N Srihari Stephen W Lam. Character recognition.

IETE Journal of Education, 17(3):154–156, 1976.

[73] Guillaume Lample, Miguel Ballesteros, Sandeep Subrama-

nian, Kazuya Kawakami, and Chris Dyer. Neural archi-

tectures for named entity recognition. In Kevin Knight,

Ani Nenkova, and Owen Rambow, editors, NAACL HLT

2016, The 2016 Conference of the North American Chapter

of the Association for Computational Linguistics: Human

Language Technologies, San Diego California, USA, June

12-17, 2016, pages 260–270. The Association for Computa-

tional Linguistics, 2016.

[74] Eunji Lee, Jaewoo Park, Hyung Il Koo, and Nam Ik Cho.

Deep-learning and graph-based approach to table structure

recognition. Multim. Tools Appl., 81(4):5827–5848, 2022.

[75] Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. A survey

on deep learning for named entity recognition. IEEE Trans.

Knowl. Data Eng., 34(1):50–70, 2022.

[76] Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou,

and Zhoujun Li. Tablebank: Table benchmark for image-

based table detection and recognition. In Proceedings of

The 12th Language Resources and Evaluation Conference,

LREC 2020, Marseille, France, May 11-16, 2020, pages 1918–

1925. European Language Resources Association, 2020.

[77] Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan

Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei.

Trocr: Transformer-based optical character recognition with

pre-trained models. In Proceedings of the AAAI Conference

on Artiﬁcial Intelligence, volume 37, pages 13094–13102,

2023.

[78] Yiren Li, Zheng Huang, Junchi Yan, Yi Zhou, Fan Ye, and

Xianhui Liu. Gfte: graph-based ﬁnancial table extraction.

In International Conference on Pattern Recognition, pages

644–658. Springer, 2021.

[79] Zewen Li, Fan Liu, Wenjie Yang, Shouheng Peng, and Jun

Zhou. A survey of convolutional neural networks: analysis,

applications, and prospects. IEEE transactions on neural

networks and learning systems, 2021.

[80] Ying Liu, Kun Bai, Prasenjit Mitra, and C Lee Giles. Table-

seer: automatic table metadata extraction and searching in

digital libraries. In Proceedings of the 7th ACM/IEEE-CS

joint conference on Digital libraries, pages 91–100, 2007.

[81] Will Lovegrove. Advanced document analysis and automatic

classiﬁcation of PDF documents. PhD thesis, University of

Nottingham, UK, 1996.

[82] Nam Tuan Ly, Atsuhiro Takasu, Phuc Nguyen, and

Hideaki Takeda. Rethinking image-based table recogni-

tion using weakly supervised methods. arXiv preprint

arXiv:2303.07641, 2023.

[83] John Mantas. An overview of character recognition method-

ologies. Pattern recognition, 19(6):425–430, 1986.

[84] Nicola Mariella and Andrea Simonetto. A quantum al-

gorithm for the sub-graph isomorphism problem. ACM

Transactions on Quantum Computing, 4(2):1–34, 2023.

[85] Elaine Marsh and Dennis Perzanowski. Muc-7 evaluation

of ie technology: Overview of results. In Seventh Message

Understanding Conference (MUC-7): Proceedings of a Con-

ference Held in Fairfax, Virginia, April 29-May 1, 1998, 1998.

[86] Dániel Marx and Michał Pilipczuk. Everything you always

wanted to know about the parameterized complexity of

subgraph isomorphism (but were afraid to ask). In 31st In-

ternational Symposium on Theoretical Aspects of Computer

Science, page 542, 2014.

[87] Ciaran McCreesh, Patrick Prosser, and James Trimble. The

glasgow subgraph solver: using constraint programming to

tackle hard subgraph isomorphism problem variants. In

International Conference on Graph Transformation, pages

316–324. Springer, 2020.

[88] Eric Medvet, Alberto Bartoli, and Giorgio Davanzo. A

probabilistic approach to printed document understanding.

Int. J. Document Anal. Recognit., 14(4):335–347, 2011.

[89] Jamshed Memon, Maira Sami, Rizwan Ahmed Khan, and

Mueen Uddin. Handwritten optical character recogni-

tion (OCR): A comprehensive systematic literature review

(SLR). IEEE Access, 8:142642–142668, 2020.

[90] Tomás Mikolov, Kai Chen, Greg Corrado, and Jeﬀrey Dean.

Eﬃcient estimation of word representations in vector space.

In Yoshua Bengio and Yann LeCun, editors, 1st Interna-

tional Conference on Learning Representations, ICLR 2013,

Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track

Proceedings, 2013.

[91] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado,

and Jeﬀ Dean. Distributed representations of words and

phrases and their compositionality. In Advances in neural

information processing systems, pages 3111–3119, 2013.

[92] Ravina Mithe, Supriya Indalkar, and Nilam Divekar. Op-

tical character recognition. International journal of recent

technology and engineering (IJRTE), 2(1):72–75, 2013.

[93] Shunji Mori, Ching Y Suen, and Kazuhiko Yamamoto. His-

torical review of ocr research and development. Proceedings

of the IEEE, 80(7):1029–1058, 1992.

[94] Adel Moussaoui. Geometric Constraint Solver. PhD thesis,

Ecole nationale Supérieure d’Informatique (ex INI), Alger,

2016.

[95] David Nadeau and Satoshi Sekine. A survey of named entity

recognition and classiﬁcation. Lingvisticae Investigationes,

30(1):3–26, 2007.

[96] Tayyab Nasir, Muhammad Kamran Malik, and Khurram

Shahzad. MMU-OCR-21: towards end-to-end urdu text

recognition using deep learning. IEEE Access, 9:124945–

124962, 2021.

[97] Michael Netter, Eduardo B. Fernández, and Günther Pernul.

Reﬁning the pattern-based reference model for electronic

invoices by incorporating threats. In ARES 2010, Fifth

International Conference on Availability, Reliability and

Security, 15-18 February 2010, Krakow, Poland, pages 560–

564. IEEE Computer Society, 2010.

[98] Clemens Neudecker, Konstantin Baierer, Mike Gerber,

Christian Clausner, Apostolos Antonacopoulos, and Stefan

Pletschacher. A survey of OCR evaluation tools and metrics.

In Apostolos Antonacopoulos, Christian Clausner, Maud

Ehrmann, Clemens Neudecker, and Stefan Pletschacher,

editors, HIP@ICDAR 2021: The 6th International Workshop

on Historical Document Imaging and Processing, Lausanne,

Switzerland, September 5-6, 2021, pages 13–18. ACM, 2021.

[99] Kyosuke Nishida, Kugatsu Sadamitsu, Ryuichiro Hi-

gashinaka, and Yoshihiro Matsuo. Understanding the se-

mantic structures of tables with a hybrid deep neural

network architecture. In Thirty-First AAAI Conference on

Artiﬁcial Intelligence, 2017.

[100] Shubham Singh Paliwal, Vishwanath D, Rohit Rahul,

Monika Sharma, and Lovekesh Vig. Tablenet: Deep learning

model for end-to-end table detection and tabular data

extraction from scanned document images. In 2019 Interna-

tional Conference on Document Analysis and Recognition,

14 VOLUME 10, 2022

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3360528

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Saoutet al.: An Overview of Data Extraction from Invoices

ICDAR 2019, Sydney, Australia, September 20-25, 2019,

pages 128–133. IEEE, 2019.

[101] Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee,

Jaeheung Surh, Minjoon Seo, and Hwalsuk Lee. Cord: A

consolidated receipt dataset for post-ocr parsing. 2019.

[102] Shreeshiv Patel and Dvijesh Bhatt. Abstractive information

extraction from scanned invoices (AIESI) using end-to-end

sequential approach. CoRR, abs/2009.05728, 2020.

[103] Shreeshiv Patel and Dvijesh Bhatt. Abstractive information

extraction from scanned invoices (aiesi) using end-to-end

sequential approach. arXiv preprint arXiv:2009.05728, 2020.

[104] Martha O Perez-Arriaga, Trilce Estrada, and Soraya Abad-

Mota. Tao: system for table detection and extraction from

pdf documents. In The Twenty-Ninth International Flairs

Conference, 2016.

[105] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt

Gardner, Christopher Clark, Kenton Lee, and Luke Zettle-

moyer. Deep contextualized word representations. In

Marilyn A. Walker, Heng Ji, and Amanda Stent, editors,

Proceedings of the 2018 Conference of the North American

Chapter of the Association for Computational Linguistics:

Human Language Technologies, NAACL-HLT 2018, New

Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long

Papers), pages 2227–2237. Association for Computational

Linguistics, 2018.

[106] David Pinto, Andrew McCallum, Xing Wei, and W Bruce

Croft. Table extraction using conditional random ﬁelds. In

Proceedings of the 26th annual international ACM SIGIR

conference on Research and development in informaion

retrieval, pages 235–242, 2003.

[107] Gorjan Popovski, Barbara Korousic-Seljak, and Tome Efti-

mov. A survey of named-entity recognition methods for food

information extraction. IEEE Access, 8:31586–31594, 2020.

[108] Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish

Visave, and Kavita Sultanpure. Cascadetabnet: An ap-

proach for end to end table detection and structure recog-

nition from image-based documents. In 2020 IEEE/CVF

Conference on Computer Vision and Pattern Recognition,

CVPR Workshops 2020, Seattle, WA, USA, June 14-19,

2020, pages 2439–2447. Computer Vision Foundation /

IEEE, 2020.

[109] Shah Rukh Qasim, Hassan Mahmood, and Faisal Shafait.

Rethinking table recognition using graph neural networks.

In 2019 International Conference on Document Analysis and

Recognition, ICDAR 2019, Sydney, Australia, September

20-25, 2019, pages 142–147. IEEE, 2019.

[110] Liang Qiao, Zaisheng Li, Zhanzhan Cheng, Peng Zhang,

Shiliang Pu, Yi Niu, Wenqi Ren, Wenming Tan, and Fei Wu.

LGPMA: complicated table structure recognition with local

and global pyramid mask alignment. In Josep Lladós, Daniel

Lopresti, and Seiichi Uchida, editors, 16th International

Conference on Document Analysis and Recognition, ICDAR

2021, Lausanne, Switzerland, September 5-10, 2021, Pro-

ceedings, Part I, volume 12821 of Lecture Notes in Computer

Science, pages 99–114. Springer, 2021.

[111] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya

Sutskever, et al. Improving language understanding by

generative pre-training. 2018.

[112] Sachin Raja, Ajoy Mondal, and C. V. Jawahar. Table

structure recognition using top-down and bottom-up cues.

In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-

Michael Frahm, editors, Computer Vision - ECCV 2020 -

16th European Conference, Glasgow, UK, August 23-28,

2020, Proceedings, Part XXVIII, volume 12373 of Lecture

Notes in Computer Science, pages 70–86. Springer, 2020.

[113] Roya Rastan, Hye-Young Paik, John Shepherd, Seung Hwan

Ryu, and Amin Beheshti. TEXUS: table extraction sys-

tem for PDF documents. In Junhu Wang, Gao Cong,

Jinjun Chen, and Jianzhong Qi, editors, Databases Theory

and Applications - 29th Australasian Database Conference,

ADC 2018, Gold Coast, QLD, Australia, May 24-27, 2018,

Proceedings, volume 10837 of Lecture Notes in Computer

Science, pages 345–349. Springer, 2018.

[114] Pau Riba, Anjan Dutta, Lutz Goldmann, Alicia Fornés,

Oriol Ramos Terrades, and Josep Lladós. Table detection

in invoice documents by graph neural networks. In 2019

ICDAR. IEEE, 2019.

[115] Mariana Rodrigues Makiuchi, Tifani Warnita, Kuniaki Uto,

and Koichi Shinoda. Multimodal fusion of bert-cnn and

gated cnn representations for depression detection. In Pro-

ceedings of the 9th International on Audio/Visual Emotion

Challenge and Workshop, pages 55–63, 2019.

[116] Ali Safaya, Moutasem Abdullatif, and Deniz Yuret. Kuisail

at semeval-2020 task 12: Bert-cnn for oﬀensive speech iden-

tiﬁcation in social media. In Proceedings of the Fourteenth

Workshop on Semantic Evaluation, pages 2054–2059, 2020.

[117] KC Santosh and Abdel Belaïd. Pattern-based approach

to table extraction. In Iberian Conference on Pattern

Recognition and Image Analysis, pages 766–773. Springer,

2013.

[118] Thomas Saout, Frédéric Lardeux, and Frédéric Saubion. A

two-stage approach for table extraction in invoices. CoRR,

abs/2210.04716, 2022.

[119] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus

Hagenbuchner, and Gabriele Monfardini. The graph neural

network model. IEEE Trans. Neural Networks, 20(1):61–80,

2009.

[120] Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel,

and Sheraz Ahmed. Deepdesrt: Deep learning for detection

and structure recognition of tables in document images. In

14th IAPR International Conference on Document Analysis

and Recognition, ICDAR 2017, Kyoto, Japan, November 9-

15, 2017, pages 1162–1167. IEEE, 2017.

[121] Sharad C. Seth and George Nagy. Segmenting tables via

indexing of value cells by table headers. In 12th ICDAR

2013, pages 887–891. IEEE Computer Society, 2013.

[122] Faisal Shafait and Ray Smith. Table detection in heteroge-

neous documents. In David S. D., Venu G., Daniel P. L;,

and Premkumar N., editors, The Ninth IAPR, DAS 2010.

ACM, 2010.

[123] Andrey Shapenko, Vladimir Korovkin, and Benoit Leleux.

Abbyy: the digitization of language and text. Emerald

Emerging Markets Case Studies, 8:1–26, 04 2018.

[124] Alexey O. Shigarov, Andrey Altaev, Andrey A. Mikhailov,

Viacheslav Paramonov, and Evgeniy A. Cherkashin. Tab-

bypdf: Web-based system for PDF table extraction. In

Robertas D. and Giedre Vasiljeviene, editors, ICIST 2018,

Proceedings, Communications in Computer and Informa-

tion Science, 2018.

[125] Shoaib Ahmed Siddiqui, Imran Ali Fateh, Syed Tah-

seen Raza Rizvi, Andreas Dengel, and Sheraz Ahmed.

Deeptabstr: Deep learning based table structure recognition.

In 2019 International Conference on Document Analysis and

Recognition, ICDAR 2019, Sydney, Australia, September

20-25, 2019, pages 1403–1409. IEEE, 2019.

[126] Scharolta Katharina Sienčnik. Adapting word2vec to named

entity recognition. In Proceedings of the 20th Nordic Con-

ference of Computational Linguistics (NODALIDA 2015),

pages 239–243, 2015.

[127] Ray Smith. An overview of the tesseract OCR engine. In

9th ICDAR, pages 629–633. IEEE Computer Society, 2007.

[128] Christine Solnon. Experimental evaluation of subgraph

isomorphism solvers. In International Workshop on Graph-

Based Representations in Pattern Recognition, pages 1–13.

Springer, 2019.

[129] Enrico Sorio, Alberto Bartoli, Giorgio Davanzo, and Eric

Medvet. Open world classiﬁcation of printed invoices. In

Apostolos Antonacopoulos, Michael J. Gormish, and Rolf

Ingold, editors, Proceedings of the 2010 ACM Symposium

on Document Engineering, Manchester, United Kingdom,

September 21-24, 2010, pages 187–190. ACM, 2010.

[130] Pedro Javier Ortiz Suárez, Yoann Dupont, Benjamin Muller,

Laurent Romary, and Benoît Sagot. Establishing a new

state-of-the-art for french named entity recognition. In

LREC 2020-12th Language Resources and Evaluation Con-

ference, 2020.

VOLUME 10, 2022 15

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3360528

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Saoutet al.: An Overview of Data Extraction from Invoices

[131] Yingyi Sun, Xianfeng Mao, Sheng Hong, Wenhua Xu, and

Guan Gui. Template matching-based method for intelligent

invoice information identiﬁcation. IEEE access, 7:28392–

28401, 2019.

[132] Ahmad Tarawneh, Ahmad Hassanat, Dmitry Chetverikov,

Imre Lendak, and Chaman Verma. Invoice classiﬁcation

using deep features and machine learning techniques. 03

2019.

[133] Suzanne Liebowitz Taylor, Richard Fritzson, and Jon A

Pastor. Extraction of data from preprinted forms. Machine

Vision and Applications, 5(3):211–222, 1992.

[134] Chris Tensmeyer, Vlad I. Morariu, Brian L. Price, Scott

Cohen, and Tony R. Martinez. Deep splitting and merging

for table structure decomposition. In 2019 International

Conference on Document Analysis and Recognition, ICDAR

2019, Sydney, Australia, September 20-25, 2019, pages 114–

121. IEEE, 2019.

[135] Erik F Tjong Kim Sang and Fien De Meulder. Introduction

to the conll-2003 shared task: language-independent named

entity recognition. In Proceedings of the seventh conference

on Natural language learning at HLT-NAACL 2003-Volume

4, pages 142–147, 2003.

[136] Panos Vassiliadis and Alkis Simitsis. Extraction, transfor-

mation, and loading. Encyclopedia of Database Systems,

10, 2009.

[137] Nataliya Le Vine, Matthew Zeigenfuse, and Mark Rowan.

Extracting tables from documents using conditional gen-

erative adversarial networks and genetic algorithms. In

International Joint Conference on Neural Networks, IJCNN

2019 Budapest, Hungary, July 14-19, 2019, pages 1–8. IEEE,

2019.

[138] Joris Voerman, Aurélie Joseph, Mickaël Coustaty, Vin-

cent Poulain D’Andecy, and Jean-Marc Ogier. Evaluation of

neural network classiﬁcation systems on document stream.

In Xiang Bai, Dimosthenis Karatzas, and Daniel Lopresti,

editors, Document Analysis Systems - 14th IAPR Inter-

national Workshop, DAS 2020, Wuhan, China, July 26-

29, 2020, Proceedings, volume 12116 of Lecture Notes in

Computer Science, pages 262–276. Springer, 2020.

[139] Qianwen Wang and Mizuho Iwaihara. Deep neural architec-

tures for joint named entity recognition and disambiguation.

In IEEE International Conference on Big Data and Smart

Computing, BigComp 2019, Kyoto, Japan, February 27 -

March 2, 2019, pages 1–4. IEEE, 2019.

[140] Yalin Wangt, Ihsin T Phillipst, and Robert Haralick. Au-

tomatic table ground truth generation and a background-

analysis-based table structure extraction method. In Pro-

ceedings of Sixth International Conference on Document

Analysis and Recognition, pages 528–532. IEEE, 2001.

[141] Zhiwen Xiao, Haoxi Zhang, Huagang Tong, and Xin Xu. An

eﬃcient temporal network with dual self-distillation for elec-

troencephalography signal classiﬁcation. In 2022 IEEE In-

ternational Conference on Bioinformatics and Biomedicine

(BIBM), pages 1759–1762, 2022.

[142] Huanlai Xing, Zhiwen Xiao, Rong Qu, Zonghai Zhu, and

Bowen Zhao. An eﬃcient federated distillation learning sys-

tem for multitask time series classiﬁcation. IEEE Transac-

tions on Instrumentation and Measurement, 71:1–12, 2022.

[143] Huanlai Xing, Zhiwen Xiao, Dawei Zhan, Shouxi Luo,

Penglin Dai, and Ke Li. Selfmatch: Robust semisupervised

time-series classiﬁcation with self-distillation. International

Journal of Intelligent Systems, 37(11):8583–8610, 2022.

[144] Wenyuan Xue, Baosheng Yu, Wen Wang, Dacheng Tao,

and Qingyong Li. Tgrnet: A table graph reconstruction

network for table structure recognition. In 2021 IEEE/CVF

International Conference on Computer Vision, ICCV 2021,

Montreal, QC, Canada, October 10-17, 2021, pages 1275–

1284. IEEE, 2021.

[145] Vikas Yadav and Steven Bethard. A survey on recent

advances in named entity recognition from deep learning

models. In Proceedings of the 27th International Conference

on Computational Linguistics, pages 2145–2158, 2018.

[146] Limin Yao, Sebastian Riedel, and Andrew McCallum. Col-

lective cross-document relation extraction without labelled

data. In Proceedings of the 2010 Conference on Empirical

Methods in Natural Language Processing, pages 1013–1023,

2010.

[147] Qixiang Ye and David S. Doermann. Text detection and

recognition in imagery: A survey. IEEE Trans. Pattern Anal.

Mach. Intell., 37(7):1480–1500, 2015.

[148] Zi Ye, Yogan Jaya Kumar, Goh Ong Sing, Fengyan Song,

and Junsong Wang. A comprehensive survey of graph neural

networks for knowledge graphs. IEEE Access, 10:75729–

75741, 2022.

[149] Burcu Yildiz, Katharina Kaiser, and Silvia Miksch.

pdf2table: A method to extract table information from pdf

ﬁles. In IICAI, pages 1773–1785, 2005.

[150] Richard Zanibbi, Dorothea Blostein, and James R. Cordy.

A survey of table recognition. Int. J. Document Anal.

Recognit., 2004.

[151] Liang Zhang and Huan Zhao. Named entity recognition

for chinese microblog with convolutional neural network. In

Yong Liu, Liang Zhao, Guoyong Cai, Guoqing Xiao, Kenli

Li, and Lipo Wang, editors, 13th International Conference

on Natural Computation, Fuzzy Systems and Knowledge

Discovery, ICNC-FSKD 2017, Guilin, China, July 29-31,

2017, pages 87–92. IEEE, 2017.

[152] Mengshi Zhang, Daniel Perelman, Vu Le, and Sumit Gul-

wani. An integrated approach of deep learning and symbolic

analysis for digital PDF table extraction. In 25th Inter-

national Conference on Pattern Recognition, ICPR 2020,

Virtual Event / Milan, Italy, January 10-15, 2021, pages

4062–4069. IEEE, 2020.

[153] Peng Zhang, Can Li, Liang Qiao, Zhanzhan Cheng, Shiliang

Pu, Yi Niu, and Fei Wu. VSR: A uniﬁed framework for

document layout analysis combining vision, semantics and

relations. In Josep Lladós, Daniel Lopresti, and Seiichi

Uchida, editors, 16th International Conference on Document

Analysis and Recognition, ICDAR 2021, Lausanne, Switzer-

land, September 5-10, 2021, Proceedings, Part I, volume

12821 of Lecture Notes in Computer Science, pages 115–

130. Springer, 2021.

[154] Kaihong Zheng, Lingyun Sun, Xin Wang, Shangli Zhou,

Hanbin Li, Sheng Li, Lukun Zeng, and Qihang Gong. Named

entity recognition in electric power metering domain based

on attention mechanism. IEEE Access, 9:152564–152573,

2021.

[155] Xinyi Zheng, Douglas Burdick, Lucian Popa, Xu Zhong,

and Nancy Xin Ru Wang. Global table extractor (GTE):

A framework for joint table identiﬁcation and cell struc-

ture recognition using visual context. In IEEE Winter

Conference on Applications of Computer Vision, WACV

2021, Waikoloa, HI, USA, January 3-8, 2021, pages 697–706.

IEEE, 2021.

[156] Xu Zhong, Elaheh ShaﬁeiBavani, and Antonio Jimeno-

Yepes. Image-based table recognition: Data, model, and

evaluation. In Andrea Vedaldi, Horst Bischof, Thomas Brox,

and Jan-Michael Frahm, editors, Computer Vision - ECCV

2020 - 16th European Conference, Glasgow, UK, August 23-

28, 2020, Proceedings, Part XXI, volume 12366 of Lecture

Notes in Computer Science, pages 564–580. Springer, 2020.

[157] Xu Zhong, Elaheh ShaﬁeiBavani, and Antonio Ji-

meno Yepes. Image-based table recognition: data, model,

and evaluation. In European conference on computer vision,

pages 564–580. Springer, 2020.

[158] Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Pub-

laynet: largest dataset ever for document layout analysis. In

2019 (ICDAR), pages 1015–1022. IEEE, 2019.

16 VOLUME 10, 2022

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3360528

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

Saoutet al.: An Overview of Data Extraction from Invoices

THOMAS SAOUT Was born in Brest, France

in 1992. In 2020, he earned an MS degree in

Computer Science, specializing in Decision

Intelligence, from the University of Angers.

After working for 6 months as a JAVA de-

veloper at KS2, a French company that pro-

duces ERP solutions, he began his PhD at

the University of Angers with the LERIA in

2021. His research focuses on Evolutionary

Algorithms, Information Retrieval, Natural

Language Processing, and Graph Pattern Recognition.

FRÉDÉRIC LARDEUX Was born in France in

1979. He received the MS and the PhD

degrees in computer science from the Uni-

versity of Angers, France in 2002 and 2005,

respectively.

Since 2006, he is professor with the

LERIA, University of Angers, France. His

research interests include Constraints (CSP,

SAT), Model transformations, Combina-

torial optimization, Metaheuristics, Evolu-

tionary Computation, Learning (Reinforcement learning, Machine

learning), and Logical Analysis of Data.

FRÉDÉRIC SAUBION Received his MS degree

and PhD degree in computer science from

the University of Orléans (France) in 1996.

From 1997 to 2003, he was an assis-

tant professor at the University of Angers

(France). He is full professor since 2004

with the faculty of science at the University

of Angers. His research interests include

Metaheuristics, Evolutionary Computation,

Machine Learning. He has supervised a

dozen of PhDstudents. He has contributed to the autonomous

search paradigm that consists in improving the automated setting

and control of solving algorithms, in particular thanks to machine

learning techniques. He has also investigated diﬀerent application

domains (biology, information retrieval. . . ).

VOLUME 10, 2022 17

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3360528

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

ResearchGate has not been able to resolve any citations for this publication.

Information Retrieval: Recent Advances and Beyond

Article

Full-text available

Jan 2023

This paper provides an extensive and thorough overview of the models and techniques utilized in the first and second stages of the typical information retrieval processing chain. Our discussion encompasses the current state-of-the-art models, covering a wide range of methods and approaches in the field of information retrieval. We delve into the historical development of these models, analyze the key advancements and breakthroughs, and address the challenges and limitations faced by researchers and practitioners in the domain. By offering a comprehensive understanding of the field, this survey is a valuable resource for researchers, practitioners, and newcomers to the information retrieval domain, fostering knowledge growth, innovation, and the development of novel ideas and techniques.

Financial Time Series Forecasting with the Deep Learning Ensemble Model

Article

Full-text available

Feb 2023

With the continuous development of financial markets worldwide to tackle rapid changes such as climate change and global warming, there has been increasing recognition of the importance of financial time series forecasting in financial market operation and management. In this paper, we propose a new financial time series forecasting model based on the deep learning ensemble model. The model is constructed by taking advantage of a convolutional neural network (CNN), long short-term memory (LSTM) network, and the autoregressive moving average (ARMA) model. The CNN-LSTM model is introduced to model the spatiotemporal data feature, while the ARMA model is used to model the autocorrelation data feature. These models are combined in the ensemble framework to model the mixture of linear and nonlinear data features in the financial time series. The empirical results using financial time series data show that the proposed deep learning ensemble-based financial time series forecasting model achieved superior performance in terms of forecasting accuracy and robustness compared with the benchmark individual models.

A Two-stage Approach for Tables Extraction in Invoices

Conference Paper

Nov 2023

OCR-MRD: performance analysis of different optical character recognition engines for medical report digitization

Article

Nov 2023

In the modern era, the necessity of digitization is increasing in a rapid manner day-to-day. The healthcare industries are working towards operating in a paperless environment. Digitizing the medical lab records help the patients in hassle-free management of their medical data. It may also prove beneficial for insurance companies for designing various medical insurance policies which can be patient-centric rather than being generalized. Optical Character Recognition (OCR) technology is demonstrated its usefulness for such cases and thus, to know the best possible solution for digitizing the medical lab records, there is a need to perform an extensive comparative study on the different OCR techniques available for this purpose. It is observed that the current research is focused mainly on the pre-processing image techniques for OCR development, however, their effects on OCR performance specially for medical report digitization yet not been studied. Herein this work, three OCR Engines viz Tesseract, EasyOCR and DocTR, and six pre-processing techniques: image binarization, brightness transformations, gamma correction, sigmoid stretching, bilateral filtering and image sharpening are surveyed in detail. In addition, an extensive comparative study of the performance of the OCR Engines while applying the different combinations of the image pre-processing techniques, and their effect on the OCR accuracy is presented.

TrOCR: Transformer-Based Optical Character Recognition with Pre-trained Models

Article

Jun 2023

Text recognition is a long-standing research problem for document digitalization. Existing approaches are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on the printed, handwritten and scene text recognition tasks. The TrOCR models and code are publicly available at https://aka.ms/trocr.

Rethinking Image-Based Table Recognition Using Weakly Supervised Methods

Conference Paper

Jan 2023

VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification

Article

Feb 2023
PATTERN RECOGN

End-to-End Document Recognition and Understanding with Dessurt

Chapter

Feb 2023

We introduce Dessurt, a relatively simple document understanding transformer capable of being fine-tuned on a greater variety of document tasks than prior methods. It receives a document image and task string as input and generates arbitrary text autoregressively as output. Because Dessurt is an end-to-end architecture that performs text recognition in addition to document understanding, it does not require an external recognition model as prior methods do. Dessurt is a more flexible model than prior methods and is able to handle a variety of document domains and tasks. We show that this model is effective at 9 different dataset-task combinations.KeywordsDocument understandingEnd-to-endHandwriting recognitionForm understandingOCR

An Efficient Temporal Network with Dual Self-Distillation for Electroencephalography Signal Classification

Conference Paper

Dec 2022

Over the years, several deep learning algorithms have been proposed for electroencephalography (EEG) signal classification. The performance of any learning method usually relies on the quality of the learned representation that provides semantic information for downstream tasks such as classification. Thus, it is crucial to improve the model’s representation learning capability. This paper proposes an Efficient Temporal Network with dual self-distillation for EEG signal classification, ETNEEG. It enhances the model’s representation learning by promoting mutual learning between higher-level and lower-level semantic information. The proposed ETNEEG consists of two main components: a parallel dual-network-based feature extractor called MLN-GRN and a dual self-distillation module. MLN-GRN includes a multi-scale local network (MLN) and a global relation network (GRN). MLN pays attention to local features of EEG data, and GRN is designed for learning global patterns of EEG data. Meanwhile, the dual self-distillation module extracts semantic information by mutual learning among the output layer and the low-level features. To evaluate the proposed method’s performance, seven widely used public EEG datasets, i.e., FaceDetection, FingerMovements, HandMovementDirection, MotorImagery, PenDigits, SelfRegulationSCP1, and SelfRegulationSCP2, are applied to a set of experiments. Experimental results demonstrate that the proposed ETNEEG achieves excellent performance on these datasets compared with fourteen existing algorithms.

Graph Neural Networks and Representation Embedding for Table Extraction in PDF Documents

Conference Paper

Aug 2022

An Overview of Data Extraction From Invoices

Abstract and Figures

Recommended publications

A Two-stage Approach for Tables Extraction in Invoices

A two-stage approach for table extraction in invoices

A Table Extraction Solution for Financial Spreading

A Deep Learning Approach for Digitization of Invoices