ArticlePDF Available

Mining text from natural scene and video images: A survey

August 2021
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 11(3)

August 2021
11(3)

DOI:10.1002/widm.1428

Authors:

Palaiahnakote Shivakumara

National University of Singapore

Alireza Alaei

Southern Cross University

Umapada Pal

Indian Statistical Institute

In computer terminology, mining is considered as extracting meaningful information or knowledge from a large amount of data/information using computers. The meaningful information can be extracted from normal text, and images obtained from different resources, such as natural scene images, video, and documents by deriving semantics from text and content of the images. Although there are many pieces of work on text/data mining and several survey/review papers are published in the literature, to the best of our knowledge there is no survey paper on mining textual information from the natural scene, video, and document images considering word spotting techniques. In this article, we, therefore, provide a comprehensive review of both the non‐spotting and spotting based mining techniques. The mining approaches are categorized as feature, learning and hybrid‐based methods to analyze the strengths and limitations of the models of each category. In addition, it also discusses the usefulness of the methods according to different situations and applications. Furthermore, based on the review of different mining approaches, this article identifies the limitations of the existing methods and suggests new applications and future directions to continue the research in multiple directions. We believe such a review article will be useful to the researchers to quickly become familiar with the state‐of‐the‐art information and progresses made toward mining textual information from natural scene and video images. This article is categorized under: Algorithmic Development > Text Mining

Mining tree of non‐spotting based methods. F, L, and F + L denote features, learning, and combination of feature and learning based method, respectively

…

The pipeline of text mining from scene images based on conventional features (F), learning (L), and the combination of both (F + L)

…

Examples of text mining from natural scene images using non‐spotting based methods

…

The pipeline of text mining from video

…

Examples of text mining based on non‐spotting based methods from video frames of different datasets

…

Figures - available from: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

This content is subject to copyright. Terms and conditions apply.

Content uploaded by Alireza Alaei

Content may be subject to copyright.

Mining Text from Natural Scene and Video Images - A Survey

Palaiahnakote Shivakumara1, Alireza Alaei2, Umapada Pal3, *

1Faculty of Computer Science and Information Technology, University of Malaya, Malaysia.

Email: shiva@um.edu.my

2Faculty of Science and Engineering, Southern Cross University, Australia

Email:ali.alaei@scu.edu.au

3Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata, India

Email: umapada@isical.ac.in

How to cite this article: Shivakumara, P., Alaei, A., & Pal, U. (2021). Mining text from natural scene and

video images: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, e1428.

https://doi.org/10.1002/widm.1428

Abstract

In computer terminology, mining is considered as extracting meaningful information or knowledge from

a large amount of data/information using computers. The meaningful information can be extracted from

normal text, and images obtained from different resources, such as natural scene images, video and

documents by deriving semantics from text and content of the images. Although there are many pieces

of work on text/data mining and several survey/review papers are published in the literature, to the best

of our knowledge there is no survey paper on mining textual information from the natural scene, video

and document images considering word spotting techniques. In this paper, we, therefore, provide a

comprehensive review of both the non-spotting and spotting based mining techniques. The mining ap-

proaches are categorized as feature, learning and hybrid-based methods to analyze the strengths and

limitations of the models of each category. In addition, it also discusses the usefulness of the methods

according to different situations and applications. Furthermore, based on the review of different mining

approaches, this paper identifies the limitations of the existing methods and suggests new applications

and future directions to continue the research in multiple dimensions. We believe such a review article

will be useful to the researchers to quickly get the state-of-the-art information and progresses made

towards mining textual information from natural scene and video images.

Keywords: Document images, Keyword spotting, Natural scene images, Video images, Text mining.

1 INTRODUCTION

Text mining involves automatic discovering/extracting new and/or previously unknown vital and quality

information using machines. It is a multidisciplinary field of research, which incorporates and integrates

different tools and concepts from information retrieval, data mining, machine learning, statistics, and

computational linguistics. Text mining has several applications in many areas, including risk and

knowledge management, cybercrime prevention, content enrichment, and fraud detection. In addition,

it can also assist mining important information from a large database, which contains heterogeneous

and diverse data, such as document, natural scene and video images. Document, natural scene and

video images can be processed to extract their content, layout and logical structures. This process can

further help to extract knowledge from images/videos at different levels of granularity, such as page,

text-line, word, and character. Automated extraction of knowledge from document images can also im-

prove document image analysis applications in different contexts, including document classification and

indexing, document reformatting, and document reconstruction.

Text detection and recognition, especially in the scene and video images, are active research topics in

the domain of document image analysis, in particular, and data/text mining, in general (Shi et al., 2014).

As the text in a natural scene or video image is the main source of semantic information and provides

rich information about the content of the image, there are several real-world text mining applications,

such as contextual advertising, business intelligence, and content enrichment. It has also been shown

in the literature that foreground information, including text and salient objects, draw the attention of

viewers(Judd et al., 2009; Alaei et al., 2015; Alaei et al., 2017). Moreover, text detection followed by

recognition is an essential part of several computer vision applications, such as automatic sign reading,

language translation, autonomous car driving, and multimedia retrieval. As an example, an intelligent

transportation system can significantly transform the traffic and travel experience of people. Driver as-

sistance systems and autonomous cars are crucial parts of such a system and improve the safety and

security of passengers (Bagi et al., 2020).

There are several methods for text mining from the images from large datasets in the literature (Zhang

et al. 2021; Lee & Wang, 2012; Jung & Lee, 2020). However, a review of the literature revealed that

most of the methods focus on annotations extracted by content-based image retrieval approaches.

From the literature, it is also noted that these methods are not robust for mining images from a large

diverse dataset because of the gap between the low and high-level features, which do not match with

the actual meaning of the content of the image. This is the main gap between the current methods and

mining applications related to the text in natural scene, video and document images. This has motivated

many researchers to propose text-based approaches for mining images from datasets. As a result,

various methods called spotting techniques are proposed based on text in the images, videos and doc-

uments to extract the exact meaning of the text content instead of annotations derived from low-level

features to bridge this gap. However, most of the research work in the literature aims at extracting

particular information from a specific domain, such as extracting information from only images, only

videos or only documents but not all three together. Due to the popularity of social media, advanced

internet technologies and variations in digitization technologies for capturing data, one can expect di-

verse and heterogeneous datasets, which may include text, images, videos and documents in different

formats. To understand the advances in these domains, we have gathered a list of the most relevant

and recent papers and written a survey on mining text from natural scene, video and document images

under a single platform. This can further allow other researchers to develop new solutions to the chal-

lenges and problems of this research domain.

The rest of the paper is organized as follows. Section 2 highlights the needs and motivation for text

mining in natural scene and video images. Section 3 provides a brief idea of the way the papers are

collected and the survey is conducted. Section 4 deals with non-spotting based mining approaches

where feature based, learning based, and the combination of both are discussed for both natural scene

and video images. Section 5 deals with spotting based mining approaches. Future directions of research

are further discussed in Section 6. Finally, conclusions are drawn in Section 7.

2 MOTIVATION FOR TEXT MINING IN NATURAL SCENE AND VIDEO IMAGES

There are several methods developed for text detection, recognition and keyword spotting from the

document, natural scene and video images. Keyword spotting, as one of the document image analysis

techniques, includes a systematic methodology and framework to facilitate this transformation and help

to create computable knowledge. This technique searches for a known vocabulary in a document image

dataset and maps them to higher-level concepts created by indexing the document images and creating

a dictionary of domain-specific terms and the knowledge they represent (Sexton et al., 2018). As an

important category of approaches for knowledge discovery and mining, and with the increase in gener-

ating textual information in the forms of the scene and video images, word spotting from the scene and

video images has grown in recent years. For the past decade, several researchers from both commu-

nities of computer vision and document analysis have developed powerful methods for scene text de-

tection and recognition. Considering the huge number of methods for text spotting, one can expect to

see different approaches, evaluation schemes, and experiments on different datasets for solving the

same problem. Moreover, due to the use of multiple datasets, evaluation schemes, measures and ap-

proaches for text spotting in the images, it is difficult to analyze the scope, limitation and significance of

the methods. This leads to confusion for the reader and viewer to choose the relevant methods, define

a new challenge and find suitable applications. In addition, scene and video text spotting as textual

mining from the images has been ignored by the community compared with the methods that only

concern either scene text detection or text recognition, separately.

Thus, there is a need for a survey on text mining from natural scene and video images to understand

the growth, the objective, scope, limitation, publicly available datasets for experimentation and compar-

ative study. The survey paper can also provide readers with a clear idea of what has been done in the

past and further show them clear directions and new applications for the future researcher. It is worth

noting that there are good survey papers, for example, Dadiya et al. 2019; Pooja et al. 2016; Sharma

et al. 2012; Ye et al. 2015 and Yin et al.2016, which include old models. Several methods have been

proposed in 2019, 2020 and 2021 for addressing different issues of text spotting but there is no survey

paper to provide a summary of the recent research papers(Chekhrouhou et al., 2021; Mokayed et al.,

2021; Li et al., 2021; Khalil et al., 2021). Therefore, this survey mainly focuses on the recently published

methods/models to provide a quick summary of the spotting techniques in this particular domain. More-

over, this survey considers the multilingual models as normal text spotting approaches for reviewing.

As deep learning end-to-end models for text spotting in natural scene and video images are proposed

to avoid preprocessing steps, such as noise removal, deblurring, text alignment issues and the effect

of perspective distortion, reviewing preprocessing methods for text spotting is considered out of the

scope of this survey. In summary, this survey on the mining of text from natural scene and video images

can provide state-of-the-art information to the researchers interested in the context of text mining. It can

further help new researchers to find new challenges, and applications and to investigate new ideas by

referring to existing ideas.

3 BRIEF METHODOLOGY

To identify text detection studies in relation to natural scene and video images in the literature, the

advanced Google search engine rather than Google Scholar, Scopus or Web of Science has been

considered in this study. The choice of Google search helped us to retrieve papers that might not be

indexed in Google Scholar and other repositories. A list of keywords, including “text spotting in natural

scene image”, “text extraction in natural scene image”, “text spotting in video”, “text extraction in video”,

“text spotting in document image”, “text extraction in document image”, “word spotting in document

image”, “text spotting”, “text detection”, “word spotting” and “text extraction” were used to broadly search

and retrieve relevant and recent articles published on different outlets. Moreover, recent review and

survey papers on text and word spotting have been studied to extract relevant text spotting references.

This process resulted in 175 papers in the domain. We then reviewed the title, keywords and abstract

of each paper and exclude them from the list, if they did not contain any form of the words, including

video, scene, image, word, spotting, detection and extraction in their titles, abstracts and the list of their

keywords. As a result, we considered and reviewed a critical mass (95 papers) from the literature of

text spotting in images and videos. It is worth noting that a few (7) other papers have also been included

to enrich the literature review.

To provide a better understanding of the survey, we initially categorized the methods into non-spotting

and spotting approaches. We further classified the methods in each category into scene- and video-

based methods. Concerning the types of methods presented in each category in the literature, features,

learning and hybrid-based methods have been considered as three distinct types of methods.

4 NON-SPOTTING BASED MINING APPROACHES

It is noted that the main focus of the data mining approaches is to extract meaningful information from

large and diverse databases. In the same way, when we consider images and video data collected from

social media, the size and variation of the data can be huge. In this context, extracting meaningful

information from such huge data is not easy for content-based image retrieval methods. This is due to

the gap between the content of the images and the extracted low-level features to appropriately repre-

sent images. To overcome this limitation of the content-based image retrieval methods, text based

methods are proposed to provide exact meaning and relevant information of the content of the images,

where they contain textual information.

These methods can broadly be classified into scene and video text based methods. The scene text

based methods focus on extracting text from natural scene images without temporal information, while

video text based methods focus on text extraction from videos by exploring temporal information. Meth-

ods in the literature of each group can further be categorized based on different perspectives, including

i) textual content (richness and sparseness of the image content), ii) type of documents (printed and

handwritten), and (iii) methodological approaches (feature, learning and the combination of both ap-

proaches). In this research work, methods in each category are categorized into three sub-categories,

namely, feature, learning and combination based methods. The schematic diagram of the methods for

extracting text from the images and video can be seen in Fig. 1, where different categories of the meth-

ods for mining information from images and video are represented. The feature-based methods give

more importance to feature extraction for finding a solution to text detection challenges and use con-

ventional classifiers for text extraction from input images or video. These methods may not be accurate

and robust to complex images. To alleviate this problem, learning-based methods are proposed. These

approaches give importance to learning with the ground truth and predefined labels to address the

challenges of text extraction. However, these methods work well only for known datasets and the per-

formance depends on pre-defined samples. Hence, the methods fall short in generic nature. This moti-

vated researchers to introduce the methods that combine both features as well as learning based meth-

ods resulting in feature+ Convolutional Neural Network (CNN) based methods. These methods can

withstand robustness, generic and high accuracy for even complex situations, as they integrate the

advantage of features and deep learning models.

4.1 Methods for Non-Spotting in Natural Scene Images

As mentioned, text mining methods based on non-spotting approaches in natural scene images can be

divided into three sub-categories of feature-based, learning-based and combination of feature and

learning based approaches as shown in Fig. 2. The feature based methods targets extracting unique

properties of text to differentiate text pixels from non-text pixels in an image. The features are extracted

based on the regular pattern of text information, such as the shape of characters, the color of text pixels,

the spacing between characters, the size of the characters, and the orientation of the characters. The

first step of the feature-based method is to remove non-text information from the images. In general,

the methods exploit the above-mentioned properties to retain text pixels and remove non-text pixels

result in a set of text candidates. The text candidates are then used to restore full text information based

on the nearest neighbor criterion, the spatial relationship between the text candidates and the orienta-

tion of text candidates. Bounding boxes are finally fixed for the words or text lines of any orientation by

exploring the concept of polygonal approximation and curve fitting.

The deep learning models of non-spotting in natural and video images can be classified further into

regression/anchor based models (Arafat et al. 2020; Chandio et al, 2020; Huang et al, 2019) and seg-

mentation based models (Dai et al, 2020; Guo et al. 2020; Cai et al, 2020). The former class considers

the whole text as an object for detection while the latter class considers merging pixel by pixel or char-

acter by character for text detection in natural and video images. When the models consider the whole

text as an object, there are chances of misclassifying non-text as text due to text like objects in the

Text Mining Tree

Scene Images

Video

F + L

F+ L

Fig. 1. Mining tree of non-spotting based methods. F, L and F + L denote Features, Learning and combina-

tion of Feature and Learning based method, respectively.

background. Therefore, the performance of the models degrades. To overcome this problem, segmen-

tation based models work at the pixel level and these methods are sensitive to complex background

compared to regression-based models.

Learning based methods use ground truth for training models. Most of the models consider pixels or/and

different forms of input images as input for designing the architecture of Neural Network (NN) based

models. As the number of layers in the NNs increases, the ability of the architecture also increases.

Therefore, the models can work on complex situations with high accuracy compared to feature based

methods. At present, the models use different architectures, such as ResNet, U-Net, and GAT, to com-

bine information and be more generic. Like feature-based methods, the outputs of the learning-based

models are text regions. The text regions are then segmented using a simple thresholding criterion. For

fixing bounding boxes, the models use curve fitting concepts and orientation of the text. In the case of

text detection, non-text does not have a boundary and hence getting relevant samples that represent

all possible cases of non-text regions is difficult. As these models are designed based on pre-defined

samples, there are, however, high chances of losing accuracy for totally unknown/unseen input images.

To ease the above limitation, some methods combine feature extraction and deep learning architec-

tures. The combined models integrate the merit of feature-based and deep learning models. Therefore,

feature extraction is generally considered as one layer for text detection in the images. The main ad-

vantage of these models is that they work well with a few samples and they are more generic and

suitable to be used for different datasets and applications. The logic and steps involved in all three

categories are shown in Fig. 2. At the same time, the sample results of text detection for mining from

different cases and situations are shown in Fig. 3, where for each query text, the corresponding method

finds text in the images.

4.1.1 Features Extraction for Text Detection

The recent methods used handcrafted features and conventional classifier for text detection in natural

scene images are listed in Table 1, where the scope and objective, strengths and weakness of the

methods are also presented. From Table 1, it is noted that the methods in the literature have addressed

almost all the challenges of text detection in natural scene images. At the same time, the strengths of

the methods indicate that the methods use different characteristics of character and text for separating

text and non-text pixels before detecting text in the images. Moreover, we observed that most of the

Scene Images

Features (F)

Bounding Box

Fig. 2. The pipeline of text mining from scene images based on conventional

Features (F), Learning (L) and the combination of both (F + L)

Text Candidates

Learning (L)

Labels

Architecture

Mining

Segmentation

Ground truth

F + L

Segmentation

Mining

methods are sensitive to poor quality images. When the images are of poor quality, there are chances

of loss of character shapes and hence text and non-text pixels in the images are classified poorly.

Table 1. Analysis of feature extraction based methods for text detection in natural scene images.

Method

Objective

Strength

Weakness

Francis et al.

(2020)

Text detection

Simple least-square SVM

Use of Otsu thresholding

Roy et al. (2020)

Text detection from multi-

views

Delaunay triangulation

Limited to two views

Raghunandan et

al. (2019)

Multi-script-oriented text de-

tection

Mutual nearest neigh-

bour concept

Sensitive poor quality im-

ages

Guo et al. (2020)

Traffic and text detection

Exploring color features.

Sensitive lighting conditions.

Liu et al. (2020)

Text detection in natural

scene images

Exploring morphological

component analysis

Performance depends on the

size of the sliding window

Panhwaret al.

(2019)

Signboard detection

Artificial neural network

Sensitive to arbitrary orienta-

tion

Khan et al. (2019)

Text detection in both natural

scene and document images

Exploring maximally sta-

ble extremal regions

Sensitive to low contrast and

poor quality images

4.1.2 Learning based Approaches for Text Detection

For the past few years, many methods have been developed using the machine learning concept for

text detection in natural scene images. As can be noted from Table 2, the models are more accurate

for text detection in natural scene images compared to the conventional methods. The models used

different architectures and combined several architectures to address the challenges of text detection.

The performances of the methods in this category highly depend on the number of samples, especially

non-text samples, and relevant samples. The last column of Table 2 presents the different weaknesses

Fig. 3. Examples of text mining from natural scene images using non-spotting based methods.

SVT

MSRA

SVT

ICDAR 2015

Query text

Text spotted

ICDAR2015

of the methods, such as being computationally expensive and producing a high number of false posi-

tives. The high number of false positives indicates that collecting and annotating relevant non-text re-

gions is not easy or is sometimes time-consuming.

Table 2. Analysis of learning based methods for text detection in natural scene images.

Method

Objective

Strength

Weakness

Tursun et al.

(2020)

Scene text detection and

erasing

Mask based text inpainting net-

work

The man focus is on inpainting

Wang et al.

(2020)

Scene text detection

Quadrilateral region proposal net-

work

It is not robust to curved text de-

tection.

Bonechi et

al. (2020)

Text detection with a

small dataset.

Weakly supervised learning ap-

proach

The architecture is tailored to a

particular task

Zhu et al.

(2020)

Scene text detection

Text center and border probability

based network.

It is sensitive to the small sized

text

Cai et al.

(2020)

Robust scene text detec-

tion

Hierarchical supervision module

with Inside to outside supervision

network.

It is computationally expensive

and sensitive to the arbitrary

shaped text

Zhenget al.

(2020)

Robust scene text detec-

tion

Multi-scale context features

based network

Sensitive to short text

Liu et al.

(2020)

Scene text detection with

fewer samples

Inductive and transductive semi-

supervised network

Poor performance for dense

text in the images

Ma et al.

(2020)

Arbitrary shaped scene

text detection

Text primitives based graph con-

volutional network

It is vulnerable to false positives

Huang et al.

(2019)

Scene text detection

The fine-grained attention mask

based network

It is vulnerable to false positives

Dai et al.

(2020)

Curved scene text detec-

tion

Multi-scale context aware feature

aggregation based network.

Sensitive to low contrast text

Qin et al.

(2019)

Curved scene text detec-

tion

Semi and weakly supervised

learning based approach

It does not work for dense text

images

Liu et al.

(2020)

Arbitrarily shaped scene

text detection

Mask tightness based network

It is vulnerable to false positives

Chandio et

al. (2019)

Multi-lingual scene text

detection

A fast RCNN network based ap-

proach

It is not effective for font-sized

variations.

Xu et al.

(2019)

Irregular scene text de-

tection

Learning deep direction field

Sensitive to large character

spacing

Arafat et al.

(2020)

Urdu scene text detec-

tion

Faster RCNN based approach

Limited to particular language

text

Xiao et al.

(2019)

Multi-oriented and multi-

language scene text de-

tection

Text context aware scene based

network

It is computationally expensive

4.1.3 Combination of Features and Learning based Approaches for Text Detection

To integrate the strength of feature extraction and deep learning models in a single method, a few

methods have been developed by combing both features and deep learning architectures as listed in

Table 3. The feature based methods obtain dominant information that represents text and can cope

with the challenges of text detection and then the deep learning models use dominant information along

with the input images information for achieving better results. The methods are capable of handling

complex situations and do not require a large number of samples to obtain accurate results. However,

the methods are generally complex and computationally expensive compared to individual feature

based and learning based methods.

Table 3. Analysis of the combination of feature extraction and learning based methods for text detection

in natural scene images.

Method

Objective

Strength

Weakness

Saha et al.

(2020)

Multi-lingual scene

text detection

Maximally stable extremal regions and

stroke width transform with generative ad-

versarial network

It is computationally expen-

sive and limited to a particu-

lar language.

Xue et al.

(2020)

Arbitrarily-oriented

low light scene text

detection

Maximally stable extremal regions and the

cloud of line distribution with a convolu-

tional neural network.

It is not robust to images

with various quality

4.1.4 Limitations of the Methods

Despite powerful methods in the literature for text detection in the images, these methods bear several

limitations. In the text detection methods based on words, as long as clear structures or shapes of all

characters, the methods perform well for text detection. For example, if a word contains a few charac-

ters, these methods can define the relationship between characters based on context features and

spatial relationships. When the number of detected characters is less in the words, these methods lose

discriminative power. From the literature, it is evident that most of the methods have used deep learning

for achieving better results. Indeed, deep learning based methods work well when they are trained with

a large number of samples. At the same time, the feature based methods use handcrafted features to

avoid dependency on a large number of samples. However, the feature based methods may not achieve

high text detection results compared to deep learning based methods. To overcome this problem, a

combination of feature based and machine learning approaches has been proposed in the literature.

But the question is how to decide which part of the problem should be taken care of by feature extraction

based part and deep learning based part. In addition, still one can expect some kind of dependency

between features and deep learning models that make some redundancy in feature extraction. In this

situation, what is the trade-off between feature and deep learning models and how to balance both

models is an important factor for consideration.

In complex situations, it is necessary to design a substantial text detection model with a complex struc-

ture, which may be computationally expensive. However, the question is “how to make it computation-

ally efficient without compromising the results and accuracy?” Furthermore, how can we design and

develop such models for real-time applications? Answer to these questions leads to a trade-off between

the results and design and the results and efficiency. Therefore, there is a scope for improvement and

inventing new ideas to make text detection methods robust and generic without losing results and effi-

ciency.

4.1.5 Summary

This section focuses on discussing the non-spotting text detection methods in natural scene images

based on handcrafted features, deep learning and the combination of both feature and deep learning.

The analysis of each category for text detection in different context and situations are discussed in this

section. Their advantages and disadvantages according to applications and different situations are also

explained. Scene text images do not provide temporal information for improving the detection results.

Therefore, their applications are limited only to scene text images. For example, it is not possible to

trace the text or objects in a series of images and it does not help to identify the action in the images.

When this temporal information is missing, there is no simple solution to restore the missing information.

This is the motivation to propose the methods for text detection in videos in the literature, which will be

discussed in the subsequent section.

4.2 Methods for Non-Spotting in Video Images

As mentioned in the previous section, the applications for text detection in videos are different from text

detection in natural scene images. For example, action recognition, event identification, tracing, surveil-

lance and monitoring are some of the applications of text detection in videos, where the methods should

use temporal information. The main advantage of these methods is the use of temporal information to

estimate motion, predict and restore missing information. The methods in this domain can be catego-

rized into feature based, learning based and the combination of both feature and learning based meth-

ods, which are similar to the methods of text detection in natural scene images as demonstrated in Fig.

4. The feature-based methods usually use temporal information at each stage to enhance the fine de-

tails in the images. For example, in text candidate detection, temporal frames are used to improve the

quality of the image. Due to the low resolution and low contrast of video frames compared to natural

scene images sometimes, the missing information or shapes should be restored. To alleviate the prob-

lem of low contrast and low resolution, most of the time, the feature based methods use temporal frames

for improving the quality of the images. In this way, the video information helps feature-based methods

for mining text in the video.

However, feature-based methods may not be accurate for complex situations as the success of the

method depends on the success of pre-processing steps. To alleviate this issue, learning-based meth-

ods are proposed and used for achieving better results. In all the steps of learning based methods,

temporal information is used to improve the performance of the methods. However, the learning based

methods may not be good for generalization as their performance depends on the size of the training

samples. Therefore, the hybrid methods use both the handcrafted feature and the deep learning model

to overcome the limitations of feature and learning based methods for text mining. In this case, the

output of feature extraction can be considered as input for the deep learning models. The sample results

of text spotting in the video frames at different situations are shown in Fig. 5. From the results, one can

conclude that the feature extraction and deep learning models are complementing each other to achieve

the best results in complex scenarios

4.2.1 Features Extraction for Text Detection in Videos

A list of the text detection methods in videos, their objective, strength and limitations are provided in

Table 4. From Table 4, it is evident that the methods have addressed almost all the challenges of text

detection in video frames. Most of the methods find the fine details of the frames, such as edges, as

the edge is a prominent feature to represent text in the images or video frames. Due to the low contrast

and low resolution of videos, these methods may not be robust to small fonts and poor quality images.

Although the methods use temporal information for improving the quality of the frames, these methods

still lose edges and hence fail to extract the shape or structure of characters. Moreover, in the case of

videos, a frame can have two different types of texts, namely, caption and scene text. Both types have

different characteristics and nature. Therefore, it is not easy to use a feature, which can work for both

types of texts.

Video

Features (F)

Bounding Box

Fig. 4. The pipeline of text mining from video,

Text Candidates

Learning (L)

Labels

Architecture

Mining

Segmentation

Ground truth

Temporal in-

formation

Segmentation

Mining

Table 4. Analysis of the feature extraction based method for text detection in video images.

Method

Objective

Strength

Weakness

Putro et al. (2019)

Real-time text de-

tection

Edge detection and clustering

based approach

The features are not robust to

frame quality

Raghunandan et

al. (2019)

Multi-script-ori-

ented text detec-

tion

Convex hull and deficiency and

clustering based approach

The method is sensitive to too

small fonts and poor quality im-

ages

Youngiu et al.

(2019)

Video text detec-

tion

Edge features based approach

Sensitive to parameters and

thresholds

4.2.2 Learning based Approaches for Text Detection in Videos

To obtain more accurate text detection results, learning based methods using temporal information

have been proposed in the literature. Table 5 demonstrates some methods that used the deep learning

approach for text detection in videos. Despite the use of different architectures in the literature, the

methods are sensitive to false positives. This is true as defining non-text and finding relevant samples

is harder than finding text regions. Therefore, though they use temporal information, there is a high

chance of producing more false positives in the machine learning based methods.

Table 5. Analysis of learning based methods for text detection in video images.

Method

Objective

Strength

Weakness

Nag et al. (2019)

Marathon bib and Jersey num-

ber detection

Deep CNN is explored

Method is limited to Marathon and sports

video

Song et al. (2019)

Video text frame detection

Use of Text Siamese Network

More false positives for complex back-

ground images

Wang et al. (2019)

Video text detection

Hierarchically exploits low-level

features through CNN

The scope is limited to frame detection

but not text detection

Yan et al. (2020)

Subtitle detection in video.

Connectionist text proposal net-

work

It is not robust to achieve good results

Yu et al. (2019)

Video text detection

Use of Convolutional LSTM

It is computationally expensive

Zhou et al. (2019)

Video text detection

YOLO architecture is explored

Application oriented method.

Fig. 5. Examples of text mining based on non-spotting based methods from

video frames of different datasets.

NUS

YVT

ICDAR 2015

ICDAR 2013

Query text

Text Spotted

4.2.3 Combination of Features and Learning based Approaches for Text Detection in Videos

To make the methods robust for text detection in videos, a combination of handcrafted features and

deep learning architecture has been proposed in the literature. Table 6 presents these methods that

use feature extraction and deep learning modes differently to obtain the best text detection results.

These methods, however, fail to address the challenges of the small font and non-uniform illumination

effect. When the methods use the combination of features and learning based approaches, the deep

learning models are used as a classifier but not as a feature extractor. Thus, these methods are com-

putationally more efficient compared to fully deep learning based methods.

Table 6. Analysis of the combination of feature and learning based methods for text detection in

video images.

Method

Objective

Strength

Weakness

Fassold et al.

(2019)

Real-time text de-

tection

Features for preprocessing and

YOLO for detection

Sensitive to the number

of temporal frames

Nag et al.

(2020)

Text of marathon

and sports video

The combination gradient magni-

tude and direction along with

CNN

It is not robust t occlu-

sion, blur and too small

fonts

Rasheed et

al. (2019)

Turkish text de-

tection

Deep Convolutional neural net-

work

Sensitive to scaling

Guo et al.

(2020)

Traffic and text

detection

Exploring color features

Sensitive lighting condi-

tions

4.2.4 Limitations of the Methods

There are two major limitations to non-spotting text detection methods in video frames for providing

poor detection results. The first problem is the poor handling of the different nature of two types of text:

caption and scene. Since the nature of scene text is unpredictable and the nature of the caption text is

predictable, it is hard to extract features that work well for both texts. One way to resolve this issue is

to apply a text classification method to classify the caption and scene texts in the video in order to

improve the final text detection results. The second problem is determining the number of temporal

frames for operations. Most of the methods assume the number of temporal frames for the operations.

When the complexity of the problem changes, this constrain may not work well. Thus, it is necessary to

find appropriate ways to determine the number of frames automatically according to the situation.

4.2.5 Summary

Likewise, text detection from images, in the case of video text detection, there are three broad catego-

ries, namely, feature based, learning based, and the combination of features and deep learning models.

Since videos usually suffer from low resolution and contrast, text detection methods generally use tem-

poral information to enhance the quality of the frames and restore missing information in order to obtain

higher text detection accuracies. Moreover, deep learning models use temporal frames(either all or a

number of them) as additional training information to generate more generic models and achieve more

accurate text detection results. Due to processing quite a large number of temporal frames, these meth-

ods need more computational power. This need for a high-performance computing machine causes a

serious issue with video text detection based methods when the architecture of the systems become

complex. Moreover, there is a need for finding a criterion to determine the optimal number of temporal

frames to be used automatically according to the problem complexity.

5. SPOTTING BASED MINING APPROACHES

The methods discussed in the previous section try to separate text and not-text and extract entire text

content from natural scene images and videos. Although text helps us to derive meaningful information

from the scene and video, it lacks the global meaning of the images and video and the extracted text

may not be representative of images or videos. These methods are also computationally expensive.

Therefore, it is motivated researchers to develop methods for spotting text in natural scene images and

videos. The spotted text provides a global meaning representing the whole image and video. These

methods are more efficient and accurate compared to the text detection methods especially for retriev-

ing information from a large pool of data. This section discusses the word spotting methods for mining

text in natural scene images and video frames. To keep the consistency of the presentation, methods

in this group are categorized into feature extraction and learning based methods.

5.1 Word Spotting Methods in Natural Scene Images

Text spotting in natural scene images commonly involves two stages: i) text detection and ii) text recog-

nition. There are two categories of approaches, including conventional and end-to-end, for text spotting

in the literature (Hui et al., 2017; Song et al., 2019). The conventional approach comprises a general

pipeline with a text detector module to initially localize the text in a scene image followed by a text

recognizer module to recognize the detected text. The end-to-end text spotting methods can simulta-

neously detect text positions and recognize them. This is in line with the human reading skill, which

performs text detection and recognition in a single shot (Song et al., 2019).

Considering the basic components of a text, text detection methods in the literature (Liu et al., 2018;

Song et al., 2019) can be classified into four categories: character-based, word-based, text-line based,

and fine-scale text proposal based approaches. In character-based methods, individual characters are

initially detected and then they are concatenated to obtain words and text lines using several post-

processing steps, including character filtering and reorganization. Character-based text detection meth-

ods can further be categorized into Connected Components (CC) based and sliding-window based

methods. In CC-based methods, as the most conventional approach of text detection in images, char-

acters are detected by grouping the pixels of similar characteristics, such as color, and intensity to

identify CCs, and then analyzing the properties of the extracted CCs to detect characters among the

set of CCs. The detected characters are then grouped to construct words or text lines. In sliding-window

(region) based methods, different window slides and local features are used to localize characters from

input images (Zamberletti et al., 2015).In word-based text detection methods, words are considered as

different objects, and therefore, these methods are categorized as general object detection methods.

These methods detect word bounding boxes from a large number of word proposals by applying a

filtering strategy based on confidence scored obtained from a trained classifier. To obtain accurate text

bounding boxes, the filtered text proposals will finally be regressed. In text-line based methods, text

lines are firstly detected and then each text-line is further segmented to obtain word bounding boxes.

In fine-scale text proposal methods, word or text-line proposals are initially detected and then the de-

tected text proposals are merged to form complete words or text lines (Zamberletti et al., 2015; Liu et

al., 2018; Song et al., 2019).

The purpose of the text recognition stage is to generate human-readable character sequences (text)

from the variable-length cropped/detected text images. Text recognition methods in the literature can

be categorized into four different groups: character-based, word-based, sequence-to-label decode

based, and sequence-to-sequence based methods (Liu et al., 2018; Song et al., 2019). Character-

based text recognition methods generally consist of three steps, including character detection, and

character recognition followed by character grouping and refining misclassified characters. This ap-

proach largely depends on the results of the character detection step and therefore, accumulated errors

are the major concern in this approach. In word-based text recognition methods, each word is consid-

ered as a whole and holistic word classification is commonly performed to achieve word recognition

(Bagi et al., 2020. A dictionary of segmented words may further need to be considered in this approach.

Sequence-based methods, as an advanced and modern way of text recognition, are widely used in the

literature (Liu et al., 2018; Song et al., 2019).In the sequence-to-label category, a feature sequence is

first extracted from the input image, and then a label sequence is predicted by neural networks (gener-

ally Recurrent Neural Networks (RNN)) providing recognized characters (Liu et al., 2018; Song et al.,

2019). Sequence-to-sequence tries to automatically obtain certain extracted CNN features and implicitly

learn a character-level language model embodied in RNN (Liu et al., 2018).

Recently, deep learning-based approaches have become dominant in both text detection and recogni-

tion stages. For text detection, CNN-based deep learning is usually used to extract feature maps from

a scene image, and then different decoders are used to decode the regions (Tian et al., 2016). For text

recognition, a network for sequential prediction is applied to the extracted text regions (Shi et al., 2017).

When the detection and recognition stages are working separately, this would be time and cost con-

suming, especially for images with several text regions. Moreover, the correlation in visual cues shared

in detection and recognition is not considered and the detection network cannot be supervised by labels

from text recognition, and vice versa (Liu et al., 2018).

Most text spotting methods in the literature (Liu et al., 2019), first, generate several text proposals using

a text detection model and then recognize them with a separate text recognition model (Jaderberg et

al., 2016; Gupta et al., 2016). The end-to-end text spotting methods commonly use a text proposal

generation model for text detection and a text recognition method for text spotting. Moreover, text spot-

ting is evolved from simple horizontal text to complicated and challenging situations, curve shape and

multi-directional text (Liu et al., 2018). It is worth noting that the earlier methods in the literature use

handcraft features for scene text spotting. Furthermore, lexicon-free end-to-end text recognition sys-

tems have recently been proposed for scene text spotting (Liao et al., 2019). In the subsequent sub-

section, a detailed discussion on feature based methods for word spotting is provided.

5.1.1 Feature Extraction for Word Spotting

Two types of features, handcrafted-based and deep-learning-based, have been used for text spotting

in images in the literature (Zamberletti et al., 2015; Gomez et al., 2017; Jaderberg et al., 2014; Jader-

berg et al., 2016). The handcrafted features used in the literature include color channels (R, G, B),

foreground intensity, background intensity, foreground Lab color, background Lab color, spatial pyramid

levels, diameter, Gradient, and stroke width. (Gomez et al., 2017). Using the extracted features and a

holistic CNN classifier, a set of word proposals were have been generated without an explicit character

segmentation to obtain word spotting in an end-to-end manner (Gomez et al., 2017). Moreover, a con-

ventional sliding window text detection based on Aggregate Channel Features (ACF) coupled with an

AdaBoost classifier has been used in the literature (Jaderberg et al., 2016). ACF features include nor-

malized gradient magnitude, the histogram of oriented gradients, and the raw grayscale pixel values.

Each channel C has been smoothed, divided into blocks and the pixels in each block were summed

and smoothed again to obtain aggregate channel features. It is noted that the ACF features are not

scale-invariant, so for multi-scale text detection, features at different scales (pyramid) need to be ex-

tracted (Jaderberg et al., 2016). The pyramidal histograms of characters as features have also been

used to represent word images and their textual transcriptions to enable both query-by-example and

query-by-string searches in a unified framework for word searches in handwritten and natural images.

The features are discriminative and the similarity between words is independent of the writing and font

style, illumination, and capturing angle (Almazan et al., 2014). Shape code based word matching for

spotting the word in Indian multilingual documents is proposed by (Tarafdar et al. 2010), where geo-

metrical features, such as extreme points, crossing counts, zonal features, loop-based features are

extracted from the input images. Similarly, the combination of rotation invariant features and SVM clas-

sifier have been used for spotting words in graphical documents and to improve the results of spotting,

SIFT features are also used by (Tarafdar et al. 2013).

Augmented multi-resolution maximally stable extremal regions and convolutional neural networks have

further been employed for text spotting from scene images (Zamberletti et al., 2015). Using simple and

fast geometric transformations on multi-resolution proposals and character augmentation without con-

sidering deep architectures and a large amount of training data provided high text detection rates in

scene images (Zamberletti et al., 2015). Moreover, Pyramid Histogram of Oriented Gradient (PHOG)

features and Zernike moments have been employed in different stages of the proposed two-stage Hid-

den Markov Model (HMM) based framework for keyword detection in video frame/scene images of

multiple scripts. The features have been extracted using a sliding window passed on the binarized text

lines segmented from the scene image/video frames. To improve the performance of the proposed word

spotting framework, a dynamic shape coding using contextual information extracted by adding time

derivatives from the neighbouring windows has further been used in the literature (Roy et al., 2019).

Different convolutional deep learning neural network based methods have recently been used as fea-

ture backbone to extract features in order to appropriately handle the text of different scales (Gao et al.,

2019; Qin et al., 2019). Features have been extracted by using the output of one or more of the hidden

layers in CNN (Gao et al., 2019; Qin et al., 2019). Sharing features extracted from CNN has also been

used to extend a character classification method to character detection and bigram classification. A rich

feature set generated by training a strongly supervised character classifier and the intermediate hidden

layers have further been considered as features for text detection, character classification, and bigram

classification (Jaderberg et al., 2014). This method leverages the convolutional structure of a CNN to

process the entire image in a single pass and generate all the features required to detect word bounding

boxes, and then to recognize detected words from a fixed lexicon using the Viterbi algorithm (Jaderberg

et al., 2014).

Moreover, edge boxes have been used in the literature to obtain text word bounding box proposals as

several collections of characters with sharp boundaries (Jaderberg et al., 2016). A region-based feature

extraction using Region-of-Interest (RoI) pooling layer has also been used to generate feature maps

with varying lengths. An RNN encoder has then been employed to encode feature maps of different

lengths into the same size (Hui et al., 2017). A bottom-up method for keyword spotting in multi-oriented

Chinese scene text has been presented by Wu et al. (2018). The method is based on the single-shot

object detection (SSD) method and detects characters and looks for the keywords by considering the

context and relationship between distance and scale of each character pair in the image (Wu et al.,

2018).

5.1.2 Learning based Methods for Word Spotting

Learning based methods for text spotting can be divided into two different categories: conventional

machine learning, and deep learning based approaches. The conventional machine learning based

methods have longer history in the literature of word spotting compared to the deep learning meth-

ods(Gao et al., 2019), whereas deep learning based methods are more advanced and recently attracted

many researchers (Jaderberg et al., 2014; Bazazian et al., 2018).

From the first category of methods, a two-stage word spotting approach based on HMM has been pre-

sented in the literature to detect keywords in multi-script text lines extracted from natural scene images

and video frames(Roy et al., 2019). A script identification has been employed to identify the script of

the line. An unsupervised dynamic shape coding based has then been used to group similar shape

characters to improve the performance. Next, the hypotheses locations have been verified to improve

retrieval performance. The proposed system has been evaluated by searching keywords in natural

scene image and video frames of English and two popular Indic scripts (Roy et al., 2019). In another

system presented in (Almazan et al., 2014) both word images and text information has been combined

with label embedding and attributes learning, and a common subspace regression. The PHOC and

scale-invariant feature transform (SIFT) descriptors of images have been computed to characterize the

images. Word images have first been encoded into feature vectors, and these feature vectors have

been used together with the PHOC labels to learn linear SVM-based attribute models. To learn SIFT

descriptors Gaussian Mixture Models (GMMs) have been utilized. As images and the corresponding

text strings in the images are close together, recognition and retrieval tasks can be seen as the nearest

neighbour problem. The proposed feature representation has a fixed length, is of low dimension, and is

very fast to compute (Almazan et al., 2014). This method can also be positioned within the conventional

methods.

Bazazian et al. (2018) have proposed character probability maps, as an intermediate representation of

images for word spotting. The character probability maps called Soft-PHOC have been obtained based

on the extended concept of the Pyramidal Histogram Of Characters (PHOC) in combination with Fully

Convolutional Networks by computing pixel-wise mapping of the character distribution in candidate

word regions. The Soft-PHOC descriptors have been used for word spotting tasks in egocentric camera

streams using text-line proposals. The text proposals have been extracted based on the application of

Hough Transform on character probability maps and scores obtained using Dynamic Time Warping

(DTW). The benefit of this technique is that there is no need to apply complex post-processing and also

it is not necessary to generate a multi-oriented bounding box proposal with four coordinates for each

proposal. Primary experiments showed that detecting lines proposals was simpler and more efficient

compared with bounding box proposals to detect query words in scene images (Bazazian et al., 2018).

Considering the second category of the methods, the ResNet-152 and the Pyramidal Histogram of

Characters (PHOC) embedding have been combined to build a script-independent multilingual word-

spotting model for Latin, Arabic, and Bangla (Indian) scripts. The proposed deep CNN (DCNN)has been

trained to deal with multilingual word-spotting as multitasking similar to detecting text in wild by the

human being. The results obtained from the system indicated that only one deep learning model can

be used to design a script-independent multilingual word-spotting system comparable with the system

using a single model per script. The system is also able to recognize handwritten words in scene images

(Al-Rawi et al., 2019).

Among the methods categorized in the second group, Jaderberg et al. (2014)presented a method com-

posed of two sequential tasks of detecting words regions and recognizing the words within these regions

for word spotting in natural images. These components have further been used together to form an end-

to-end text spotting system for images. A Convolutional Neural Network (CNN) classifier has been de-

signed to handle both tasks. Many layers of the proposed CNN architecture have also been used as

features for text detection, character recognition, and bigram classification (Jaderberg et al., 2014). The

results obtained from the system indicated the significance of jointly learning features to build multiple

strong classifiers (Jaderberg et al., 2014). Jaderberg et al.(2016) have used the same pipeline to first

extract region proposals for text detection. Proposals have then been filtered using a random forest

classifier to reduce the number of false-positive detections. Deep CNNs have been designed to refine

proposals based on bounding box regression and perform word recognition on each refined region

proposal at the same time. Detection and recognition results have been merged and assigned a score

to each text proposal to be able to perform thresholding on the detection results to obtain final text

spotting results(Jaderberg et al., 2016). This pipeline along with a fast subsequent filtering stage en-

sured to obtain high recall for improving the precision. The CNNs have been trained solely on data

produced by a synthetic text generation engine, requiring no human-labeled data (Jaderberg et al.,

2016). This system is fast and scalable as datasets of millions of images can be used for instant text

based image retrieval without any perceivable degradation inaccuracy. Additionally, the recognition

model has been trained purely on synthetic data that allows the system to be easily re-trained for the

recognition of other languages or scripts, without the need for any human labeling data (Jaderberg et

al., 2016). Augmented multi-resolution maximally stable extremal regions and CNNs have further been

used for text spotting from scene images. Moreover, text character proposals have been augmented to

maximize text detection rates by using not very deep architectures and a small amount of training data.

Simple and fast geometric transformations on multi-resolution proposals have finally been used as de-

scriptors to detect text characters (Zamberletti et al., 2015).

Unlike the methods that deal with the problem of text spotting considering the text detection and text

recognition separately, recent deep learning based methods try to integrate the detection and recogni-

tion stages with an end-to-end trainable neural network to get the advantages of the complementarity

of text detection and recognition in a single framework. The method presented in (Hui et al., 2017) is

among the first attempts that used such a concept. In (Hui et al., 2017), a unified framework based on

text proposal network, Recurrent-CNN (R-CNN), and Long-Short Term Memory (LSTM) have been

proposed to simultaneously localize and recognize text with a single forward pass, avoiding image

cropping, feature re-computation, word separation, and character grouping. The framework has been

trained end-to-end, using images, ground-truth bounding boxes and text labels to obtain convolutional

features and use them for both detection and recognition purposes. This multi-task training saves pro-

cessing time and the learned features become more informative, improving overall performance (Hui et

al., 2017).

In a recent work, Bazazian et al.(2018)have designed a fully CNN to generate character attribute

heatmaps for all characters. A rectangle classifier has been used to fuse text proposals and heatmaps

to detect the most likely rectangle for the query word in scene images. The method can handle the

problem of unconstrained word spotting for scene images (Bazazian et al., 2018). Liu et al. (2018) have

also performed text spotting on the oriented text in an end-to-end fashion applying text detection and

recognition simultaneously using a Fast Oriented Text Spotting (FOTS) network. The method has been

built using CNN, which learns and shares features for text detection and recognition. The joint training

method has provided better performance compared to two-stage methods (Liu et al., 2018).

An end-to-end trainable framework called Word Segmentation Guided Characters Aggregation Net

(WAC-Net) has further been developed to spot arbitrary shape text of different scripts in scene im-

ages(Gao et al., 2019). A shared convolutional backbone and the word-level instance-aware segmen-

tation network (WSN) and the char-level detection and recognition network (CDRN) work together to

spot texts in one single forward pass. The WSN and CDRN must jointly be trained by multi-task learning

(Gao et al., 2019). Moreover, a trainable neural network called Mask TextSpotter has been presented

to achieve both detection and recognition in text multi-script instances of irregular shapes directly from

two-dimensional space via semantic segmentation. In addition, a spatial attention module has been

used to enhance the performance and generality of the end-to-end text spotting approaches(Liao et al.,

2019). An end-to-end trainable network based on instance segmentation has also been proposed to

simultaneously detect and recognize the text of arbitrary shapes in scene images. An attention model

has further been considered to decode the textual content of each arbitrary shape text region. A simple

RoI masking has finally been employed to extract features from arbitrary shape text regions.

To avoid the feature refinement among the detector and the recognizer, and directly feed features ex-

tracted from the detected text instances to the decoder, the results of an existing OCR engine, as weekly

labeled data, have been used to train the recognition model to improve both the detection and recogni-

tion accuracies (Qin et al., 2019). Song et al. (2019) have further proposed a combination of convolu-

tional and recurrent neural networks by sharing a convolutional feature map to address scene text de-

tection and recognition at the same time. The text has been detected and recognized in a simple forward

propagation to eliminate redundant processes, such as image patch cropping and continually compu-

ting feature maps. The unified neural network has been trained using images and ground-truth bounding

boxes and text labels and promising performances in terms of computation time and accuracy have

been achieved without applying complicated post-processing steps (Song et al., 2019). Zhou et al.

(2019)has also presented another end-to-end deep neural network model called Multi-Language Scene

Text Spotter (MLTS) for multi-language scene text detection, recognition and script identification. A

special backbone for text features and two different types of attention have been considered to achieve

state-of-the-art performance for both text spotting and script identification in natural images (Zhou et

al., 2019). Recently, Bagi et al. (2020) have proposed an end-to-end trainable deep neural network

based on local, global and contextual information of multi-scale feature maps of a lightweight backbone

network for text spotting instances in scene images with background clutters, partially occluded text,

truncation artifacts, and perspective distortions. The problem of inter-class misclassification has been

addressed by maximizing inter-class separability and compacting intra-class variability using Gaussian

softmax. Multi-language character segmentation and word-level recognition have also been incorpo-

rated into the system. The proposed text spotting method provided high accuracies for detecting multi-

lingual text, logos, and symbols in scene images with the cluttered background environment captured

from resource-constrained devices, such as smartphones (Bagi et al., 2020). Furthermore, Liu et al.

(2020a)have introduced an end-to-end trainable unified framework for arbitrary shape text spotting by

integrating holistic-, pixel- and sequence level semantic information into the system. The Mask R-CNN

has been customized to obtain both holistic- and pixel-level semantics for text recognition. The two-

dimensional feature maps extracted from the text spotting task has been fed into an additional text

recognition branch. One-dimensional sequence-level semantics extracted based on an attention-based

sequence-to-sequence network has also been used for text recognition. Finally, the results obtained

from all three levels of semantics have been combined to achieve high accuracies in text recognition

and spotting. Besides, the wide descriptions of texts obtained from the framework enabled the system

to use only word-level weakly annotated data for training a model for robust text spotting (Liu et al.,

2020a). A bottom-up approach for text spotting in scene images was also developed by Fan et al.

(2020). A character detector based on an Extremal Region (ER) detector and an Aggregate Channel

Feature (ACF) detector has been proposed to first detect character candidates with high recall rates.

The real character proposals have then been determined using a CNN filter for high character detection

precision. A hierarchical clustering algorithm, which combines multiple visual and geometrical features,

has finally been designed to group characters into word proposals for word recognition (Fan et al.,

2020). The bottom-up approach for keyword spotting and context extraction in multi-oriented Chinese

scene images has further been presented. The proposed approach includes character detection, key-

words spotting and context extractor of which character-level text detection and recognition have been

performed simultaneously using SSD network. Furthermore, the geometric relationship between key-

words and their context has been analyzed to spot the keywords. Finally, the context extractor has

filtered out the wrong keywords and produce the context of the keywords according to the geometric

location of keywords (Wu et al., 2018).

An Adaptive Bezier-Curve Network (ABCNet) has further been proposed, for the first time, to fit oriented

or curved text by a parameterized Bezier curve. A BezierAlign layer has also been designed for extract-

ing accurate convolution features of text instances that significantly improved the precision measure.

Compared with standard bounding box detection, the Bezier curve detection has a negligible computa-

tion overhead. However, the method can handle text spotting and recognition efficiently and accurately

compare to state-of-the-art methods. It is also 10 times faster than recent state-of-the-art methods (Liu

et al., 2020b). To efficiently handle text spotting in blurry scene images, Bagi et al. (2020)has proposed

a text spotter called Blurred TextSpotter. An encoder-decoder, as the backbone network, based on

multi-scale contextual information followed by spatial and channel-wise attention have been considered

in the Blurred TextSpotter. Text masks have accurately been detected and classified using a hardware-

efficient recognition module (Bagi et al., 2020). Different datasets have been used to evaluate the word

spotting methods introduced in the literature.

Visual context information has further been used bySabir et al. (2020) to train/tune and evaluate existing

semantic similarity-based text spotting baselines for re-ranking the produced text hypothesis resulting

in improvement in the accuracy of the text spotting. A visual context dataset has been introduced for

text spotting in the wild by including information, such as a textual image description (caption), the

names of objects and their places in images, about the scene images of the publicly available dataset

COCO-text. This enables researchers to use semantic relations between texts and scenes in their text

spotting systems (Sabir et al., 2020).

As text-line based text spotting methods are unable to handle arbitrary Chinese text (text-lines) in scene

images, a character-based framework composed of three modules, including character detection, char-

acter recognition and character grouping, has been proposed in the literature to spot Chinese text scene

images(Song et al., 2019). A Conditional Random Field (CRF) based character grouping algorithm has

been used to arrange arbitrary Chinese text. The proposed framework achieved superior performance

compared with state-of-the-art text-line based methods when applied to ReCTS-ARB549 dataset (Song

et al., 2019).

Recently, a pipeline of text spotting composed of text detection and recognition has been proposed to

perform text spotting in natural scene images containing complicated backgrounds, various fonts,

shapes, and orientations(Wang et al., 2020). The text detection component called UNet, Heatmap, and

Textfill (UHT) used a UNet to compute heatmaps for candidate text regions and a Textfill algorithm to

produce a polygonal based bounding box for each word in the candidate text region. The UNet has

been trained with ground-truth heatmaps. The proposed text spotting framework, called UHTA, has

been designed by concatenating the UHT and a state-of-the-art text recognition system. The system

has been applied on four public scene-text-detection datasets, including Total-Text, SCUT-CTW1500,

MSRA-TD500, and COCO-Text and the results indicated the effectiveness of the UHT in detecting

multilingual and curved text in scene images (Wang et al., 2020). The proposed method is, however,

complex and needs to tune many parameters. As can be seen from the literature, different datasets

have been used to evaluate the above-mentioned methods. To get an idea of the other datasets used

for evaluation in the literature of this domain, a list of datasets is presented in Table 7. As shown in

Table 7, most of them are various versions of the ICDAR datasets used for the evaluation of text spotting

methods in several ICDAR competitions.

Table 7. Datasets in the literature for text recognition and spotting (Liu et al., 2020b)

Name

Description

Diction-

ary

Number of the

Bounding box

Number

of text

ICDAR17-T2

ICDAR 2017 Task 2: from COCO-Text dataset

46K

ICDAR13

ICDAR 2013

SVT

Street View Text

647

Synth90K

A synthetic dataset with a dictionary of 90K

words

90K

ICDAR17-V

Image+Textual dataset from ICDAR17 Task-3

10K

25K

COCO-Text-V

Image+Textual dataset from COCO-Text

16K

60K

COCO-Paris

Only Textual dataset from COCO-Text

158K

5.1.3 Summary

From the text spotting methods discussed in the above sections, it is noted that these methods have

already addressed different challenges of word spotting in natural scene images. However, these meth-

ods are mostly limited to scene images and most of them may not be suitable for videos. In the case of

videos, due to temporal frames, the methods proposed for natural scene images are firstly not capable

of using temporal information for improving the performance of text spotting. Secondly, they will become

expensive as they have to deal with many temporal frames to perform text spotting. Therefore, specific

methods should be developed for word spotting in videos by exploring temporal information. The fol-

lowing section, therefore, focuses on feature extraction and learning based methods for word spotting

in the video frames.

5.2 Methods for Word Spotting in Video Images

As the volume of generating videos is increasing every day, the automatic retrieval of videos based on

their content is necessary to reduce the time of manual indexing of such a huge number of videos. Text

or word spotting in videos, including lecture videos, has not been explored in the literature compared to

text spotting in scene images (Dutta et al., 2018; Jha et al., 2018). There are various types of texts in

videos, such as running text, and arbitrary text. Running texts in broadcast videos generally appear

horizontally at fixed positions with good contrast and little variation. These characteristics make text

detection in broadcast videos comparatively easier compared to other types of text in videos (Dutta et

al., 2018). However, despite achieving highly accurate recognition rates from text images, not many

text spotting methods for lecture videos have been reported in the literature (Dutta et al., 2018). Lecture

videos are rich with textual information and understanding the textual information can help with better

video understanding and retrieval. Text detection and recognition in presentation slides have been per-

formed to match the slides with the lecture videos (Wang et al., 2003). Edge detection and geometry

based methods followed by a commercial OCR have been used for text spotting (Wang et al., 2003).

Keyword search and video indexing have also been performed using off-the-shelf OCR systems (Tuna

et al., 2011). Furthermore, the combination of Automatic Speech Recognition and OCR methods have

been employed on lecture videos to extract keywords for video text spotting (Yang et al., 2014).

In addition, a recognition-free pipeline for video retrieval has been proposed to retrieve silent speech

videos containing a queried word in the form of a video clip. The method uses video segmentation to

obtain a set of word proposal clips. A similarity measure and threshold have then been used to decide

if a ‘word proposal clip’ contained the spotted word. A query expansion technique using pseudo-relevant

feedback and a re-ranking method based on correlation maximization has also been proposed in the

system (Jha et al., 2018). To review the state-of-the-art text spotting methods in videos, in line with the

previous sections three categories, feature, learning based approaches and the combination of both

approaches are considered in the next subsections.

5.2.1 Feature Extraction Methods for Word Spotting

Local appearance and global structure information of characters have been considered for character

recognition in video frames. Part-based tree structures have been used to model each category of

characters to detect and recognize characters simultaneously. The HOG features, as the local appear-

ance descriptor, and color channels with the largest gradient magnitude for color images have been

used in the literature for video and scene text recognition(Benabdelaziz et al., 2020). Structure-guided

character detection and linguistic knowledge have also been considered in the proposed system (Judd

et al., 2009). For word recognition, the detection scores and language model have been combined into

the posterior probability of character sequence from the Bayesian decision view. The final word-recog-

nition result has been obtained by maximizing the character sequence posterior probability via the

Viterbi algorithm (Judd et al., 2009). PHOG features have been computed from the gray and binarized

images for date spotting in natural scene images and video frames with complex backgrounds, blur

noise, and low resolution. Binary and gray image features have been combined by MLP based tandem

approach. The proposed date spotting framework has been built using three different HMMs and ap-

plied to segmented text lines from natural scene images and video frames without segmenting charac-

ters or words (Roy et al., 2015; Roy et al., 2018).

Moreover, the combination of Texture-Spatial-Features (TSF) has been used for keyword spotting in

video images of different fonts, contrasts, backgrounds and font sizes without word recognition. The set

of texture features has been extracted to identify text candidates in the segmented word image using

the K-means clustering technique(Shivakumara et al. 2015). The combination of Radon and Fourier

coefficients has been considered to define context features based on coefficient distributions of fore-

ground and background of text candidates. Canny edges extracted from the image and minimum cost

path based ring growing have been used to restore missing text components. These features have

been extracted locally and globally for spotting words from videos, natural scene and license plate

images (Shivakumara et al., 2019). However, these feature extraction methods are generally sensitive

to noise. Moreover, they seem to be selected arbitrary and may not provide high text spotting accuracies

on other datasets.

5.2.2 Learning based Approaches for Word Spotting

To promote research on text spotting in video lectures, a new dataset, called LectureVideoDB, com-

posed of frames from 24 different course videos, including science, management and engineering, has

been introduced in the literature (Dutta et al., 2018). The quality and resolution of videos, camera angle

and its distance from the blackboard vary in the collected videos, but the texts are the focus of cameras.

Experimental results obtained from the existing methods in the literature indicated that the existing

methods need to be improved for accurate text spotting in lecture videos (Dutta et al., 2018).To detect

handwritten text, math expressions and sketches in lecture videos, a deep learning based method was

applied by Kota et al. (2018). The proposed system can generate a summary of the content presented

over time in the lecture while addressing the problem of content occlusion. By employing the proposed

system timestamp-based semantically meaningful bounding box annotations can be provided for the

handwritten whiteboard content in the AccessMath dataset (Kota et al., 2018). A method based on deep

and transfer learning has also been presented for handwritten word retrieval in the literature(Benab-

delaziz et al., 2020). The visual features extracted from both deep and transfer learning methods have

been considered for retrieval experiments on the ICDAR15 word spotting dataset. Six different CNN

architectures and three distance metrics have been used for the experiments. Despite the complexity

of handwritten word spotting, deep CNNs have been tuned using transfer learning to provide efficient

word-spotting (Benabdelaziz et al., 2020).In addition, Mafla et al. (2020) have proposed a single shot

CNN architecture for scene text retrieval to obtain word bounding boxes as a compact representation

of spotted words. The problem has been modeled as a nearest neighbor search of the textual repre-

sentation of the input query over the outputs of the CNN obtained from an image database. The pro-

posed is fast and suitable for multilingual and real-time text spotting in videos (Mafla et al., 2020).

5.2.3 Combination of Feature and Learning based Approaches for Word Spotting

Compared to feature and learning based methods, the hybrid or combination approach for text spotting

from videos has received less attention from researchers. Recently, Mafla et al. (2020) have proposed

a real-time word spotting method based on a fully convolutional neural network to detect and recognize

text in a single pass. The PHOC descriptor has been used to universally encode the presence of a

specific character in a visual region of the proposed bounding box of a language-specific text string

using a CNN based model (YOLOv2 object detection network). The single-shot detection model has

been trained to construct the PHOC by automatically learning character attributes independently and

transferring knowledge acquired at the training phase to build PHOCs of unseen words at inference

time. The proposed PHOC version is a binary vector of size 820 dimensions constructed by concate-

nating the L2 to the L6 unigram levels along with 2 levels of the 50 most common English language

bigrams. As the proposed network uses a smaller filter size in the model’s last filter, it can perform in

real-time (Mafla et al., 2020). Moreover, using a bigger PHOC along with more unigram and bigram

levels can provide superior scene text retrieval results compared with the state of the art results on

different datasets, including multilingual datasets (Mafla et al., 2020). However, this method is lan-

guage-specific and may not be easily applicable to other scripts.

5.2.4 Summary

From the literature on text spotting in videos, it can be noted that more attention has been put towards

feature and learning based approaches compared to the hybrid or the combination approach. Interest-

ingly, the feature extraction based methods do not require a large number of samples for training and

have fewer tuning parameters compared to the learning based approach. However, these methods can

adapt to different situations and applications at the cost of accuracy in contrast to learning based meth-

ods. There are hybrid methods that consider the combination of handcrafted features and learning to

achieve better results than individual approaches. However, the main issue with this type of approach

is developing appropriate and effective techniques for feature extraction and learning stages. At the

same time, the performance of the hybrid methods depends on the success of each stage.

5.3 Limitations of the Spotting based Text Mining Methods

We tried to provide a critique of each method during the overview of the literature. Considering the

research work in the literature of text spotting in images and videos, several more limitations can also

be pointed out as follows. Generally, conventional text spotting methods in images and videos highly

depend on binarization, connected component analysis, and segmentation tasks. These tasks are very

sensitive to image/video contrast, complex background, and resolution. Moreover, they are sensitive to

font types and sizes, noise, distortion and degradations. Therefore, these tasks and their pipeline as a

whole, which is called the conventional model, may not provide good keyword spotting in the natural

scene, and video frames (Shivakumara et al., 2019). Some methods in the literature used a recognition

module for text spotting. Though the recognition module can improve text spotting results, these meth-

ods were highly dependent on training data, especially, they require large and unconstrained datasets

for training and verification. In addition, some of the methods in the literature are dependent on word

lexicons with fixed vocabulary, which results in the limitation of the methods for unconstrained text

spotting (Mafla et al., 2020). Moreover, feature extraction methods, especially when dealing with a huge

number of video frames, are computationally expensive. This may affect their suitability for real-time

applications that need to be both efficient and accurate. It is also worth mentioning that most of the

descriptors are sensitive to contrast, background variations and degradations which make the feature-

based text spotting method likely unreliable in images and video frames. The general problem with

deep-learning methods is the need for a large number of training samples that may result in the loss of

their generic property (Shivakumara et al., 2019; Mafla et al., 2020). The use of pre-trained models may

be suitable to be used at different stages of the text spotting problem.

6. FUTURE DIRECTIONS

As future work in the text spotting research domain, the applications of rule mining can be investigated

in three different contexts, namely document classification and retrieval, automated layout correction,

and automated generation of documents. The application of spatial data mining techniques can also be

investigated to find associations between logical components in document images employing document

analysis and understanding methods. As data annotation in document analysis applications is time-

consuming and expensive and at the same time, machine learning based approach for text spotting

requires a huge amount of data for training and fine-tuning parameters of the model, another direction

of research in this domain is using semi-supervised methods trained with weakly annotated data.

It is noted from the review on text spotting in natural scene and video images that none of the methods

can handle all the challenges of keyword spotting in a natural scene and video image. Most methods

focus either on natural scene images or video images but not both. This is because the nature of scene

images and video differ in terms of characteristics and complexities. In addition, in the case of video-

based methods, there is no proper criterion to automatically determine the number of temporal frames

according to the complexity of the problem. Most methods aim to find a solution to the problems but a

few methods have addressed the issue of system and prototype design for real-time applications. When

we look at target applications of text detection and spotting in the natural scene and video images,

retrieval and indexing is the main application. There are forensic and healthcare applications, such as

tracing a bomber or ill-health person with the help of bib number in a marathon, person re-identification

through text spotting on jerseys in case of sports, which can be used for person behavior identification.

Text spotting in these applications is challenging because of the short length of text, occlusion and

movements.

Furthermore, the scope of the existing mining approaches is confined to 2D natural scene, video and

document images but not 3D images. However, due to the availability of 3D cameras, scanners and

future 3D TV, one can expect 3D images, 3D TV, 3D movie. In this case, the existing methods may not

be effective or applicable to these types of data. The main reason is that depth data/information creates

shadow and allows to write decorative characters in the text. The presence of shadow and irregular

decorative shaped characters affects the actual shape of the characters in the text. Therefore, the per-

formance of the existing methods declined for such cases. To overcome this problem, one way is to

classify 2D and 3D text images to use existing 2D images and modify the same methods for 3D images.

Another way is to detect shadow and depth information in the 3D images and remove the depth. This

is possible because the pixels representing shadow have low values compared to the pixels represent-

ing text pixels. This results in 2D text images and hence existing 2D methods can be applied for text

spotting. One more way is to develop a new method that can work for both 2D and 3D images without

classifying the images or without shadow removal. As text is common in both 2D and 3D, it is possible

to define context based on recognition results and natural language processing that can help to find

text in both types of images.

Nowadays, many closed-circuit television (CCTV) cameras are fixed in the cities, houses, hotels, malls,

roads, streets to identify crimes and use data as cues and evidence. When the same spot is captured

by multiple CCTV cameras, the same text in the views appears in different forms due to variation in

distance, angle, height from the ground and configuration. In addition, each view may suffer from dif-

ferent adverse effects, such as low resolution, contrast, missing information, and perspective distortion.

As a result, the existing methods may not work well. This is a new direction of research in this domain

to investigate and propose new ideas that can use information in different views to predict the correct

text information.

Another new trend and research topic is text spotting in underwater images and videos. The complexity

of the problem depends on water depth and water clarity. As clarity decreases, the complexity of the

spotting text increases. Spotting text in underwater images is not easy because of the poor quality,

degradation and occlusion of the text. Therefore, the existing methods may not work well for underwater

images. In this case, since the properties of text and water information are different, the method should

explore these cues to enhance the fine details in the text. Then we can use text detection methods for

extraction. Another new application related to forensic, crime and person behavior identification is tattoo

text spotting in the images. Spotting text in those images can help us to study person psychology,

behavior, personality traits, person identification, and gang identification. However, detecting tattoo text

is challenging compared to text in natural scene, video and document images. Tattoo text is handwriting

text with a decorative style and embedded on the skin of different parts of the human body. To find a

solution to this problem, one can think of detecting skin to reduce the complexity of the problem. There

are many methods available for skin detection. Skin detection results can be considered as context to

detect tattoo text in the images. To fix the exact bounding box for each tattoo text line, we need to

explore natural language processing and recognition results because tattoo text lines are connected to

the decorative background and other tattoo text lines. It is also possible to integrate text, image, video

and audio information for text mining from sports-related datasets. This can be considered as another

direction for mining text from sports datasets to understand and analyse different sports or games.

7. CONCLUSIONS

In this research work, we have provided a comprehensive review of the recent advances in the literature

of mining text from the natural scene, video and document images. We identified the objective, signifi-

cance, and scope of different methods. Furthermore, we presented datasets, evaluation schemes, and

measures used in the literature of text spotting in images and videos. With the analysis of methods in

the literature, it has been learnt that for mining meaningful information from the video, scene and doc-

ument images, two typical methods known as non-spotting text detection models and keyword spotting

models can be employed on images and videos. It has further been found that most data mining meth-

ods focus on the content of the images but not the text in the images/videos. Considering technological

aspects and the use of different attributes and components in each method, the survey has revealed

that the models in the literature can be categorized and further analyzed based on the types of their

features, and learning strategies. This categorization has highlighted the usefulness, effectiveness and

limitations of each category and model with respect to different applications. The analysis of the meth-

ods in each category has further revealed that feature-based models for both non-spotting and spotting

are good in terms of flexibility, adaptability, and generalization concerning different situations and ap-

plications. In the case of learning based models, we noted that the success of these methods highly

depends on fine-tuning various parameters and the number of training samples. In contrast, the survey

revealed that since hybrid models consider the advantages of both feature engineering and learning

based models, the hybrid models perform better than features and learning based models in complex

situations.

The survey has further revealed that there are several potential applications in the field of text mining,

namely, text mining in 3D videos, sports event mining based on multiple views captured by different

CCVT cameras, and person re-identification from the data captured by multiple cameras. The new ap-

plications pose several challenges and open problems for researchers in the field of text mining. Con-

sidering the limitations observed in current text detection methods on video and natural scene images

as well as new applications and their associated challenges, researchers can find several research

opportunities to investigate and explore new text mining models and solutions for those open challeng-

ing problems.

References

Alaei, A., Conte, D., & Raveaux, R. (2015). Document image quality assessment based on improved

gradient magnitude similarity deviation. In 2016 12th IAPR workshop on document analysis sys-

tems (DAS), pp. 176-180.

Alaei, A., Raveaux, R., & Conte, D. (2017). Image quality assessment based on regions of interest.

SIViP, 11, 673–680.

Almazan, J., Gordo, A., Fornés, A., & Valveny, E. (2014). Word spotting and recognition with embed-

ded attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 2552–2566.

Al-Rawi, M., Valveny, E., & Karatzas, D. (2019). Can one deep learning model learn script-independent

multilingual word-spotting? In 2019 international conference on document analysis and recognition

(ICDAR), pp. 260–267.

Arafat, S. Y., & Iqbal, M. J. (2020). Urdu-text detection and recognition in natural scene images in deep

learning. In Proceedings on ELMAR, pp. 96787–96803.

Bagi, R., Dutta, T., & Gupta, H. P. (2020a). Cluttered TextSpotter: An end-to-end trainable light-weight

scene text spotter for cluttered environment. IEEE Access, 8, 111433–111447.

Bagi, R., & Dutta, T. (2020b). Cost-effective and smart text sensing & spotting in blurry scene images

using deep networks. IEEE Sensors Journal, 1–1. https://doi.org/10.1109/JSEN.2020.3024257

Bazazian, B., Karatzas, D., & Bagdanov, A. D. (2018a). Word spotting in scene images based on char-

acter recognition. In 2018 IEEE/CVF conference on computer vision and pattern recognition work-

shops (CVPRW), pp. 1953–19532.

Bazazian, D., Karatzas, D., & Bagdanov, A. D. (2018b). Soft-PHOC descriptor for end-to-end word

spotting in egocentric scene images. In The third international workshop on egocentric perception,

interaction and computing (EPIC) at ECCV2018, pp. 1-9.

Benabdelaziz, R., Gaceb, D., & Haddad, M. (2020). Word-spotting approach using transfer deep learn-

ing of a CNN network. In 2020 1st international conference on communications, control systems

and signal processing (CCSSP), pp. 219-224.

Bonechi, S., Bianchini, M., Scarselli, F., & Andreini, P. (2020). Weak supervision for generating pixel-

level annotations in scene text segmentation. Pattern Recognition Letters, 138, 1–7.

Brisinello, M., Grabic, R., Vranjes, M. & Vranjes, D Review on text detection methods on scene images.

In 2019 international symposium ELMAR.

Cahndio, A. A. & Pickering, M. (2019). Convolutional feature fusion for multi-language text detection in

natural scene images. In 2019 2nd international conference on computing, mathematics and engi-

neering technologies (iCoMET).

Cai, Y., Wang, W., Chen, Y., & Ye, Q. (2020). IOS-Net: An inside-to-outside supervision network for

scale robust text detection in the wild. Pattern Recognition, 103, 107304.

Chandio, A. A., & Pickering, M. (2019). Convolutional feature fusion for multi-language text detection in

natural scene images. In 2019 2nd international conference on computing, mathematics and engi-

neering technologies (iCoMET).

Cheikhrouhou, A., Kessentitini, Y., & Kanoun, S. (2021). Multi-task learning for simultaneous script

identification and keyword spotting in document images. Pattern Recognition, 113.

Dadiya, N. J., & Goswami, M. M. (2019). Multiscript text detection from images: A survey. In 2019

innovations in power and advanced computing technologies (i-PACT), pp. 1-5.

Dai, P., Zhang, H., & Cao, X. (2020). Deep multi-scale context aware feature aggregation for curved

scene text detection. IEEE Transactions on Multimedia, 99, 1969–1984.

Dutta, K., Mathew, M., Krishnan, P., & Jawahar, C. V. (2018). Localizing and recognizing text in lecture

videos. In 2018 16th international conference on frontiers in handwriting recognition (ICFHR), pp.

235–240.

Fan, J., Chen, T., & Zhou, F. (2020). BURSTS: A bottom-up approach for robust spotting of texts in

scenes. Journal of Visual Communication and Image Representation, 71.

Fassold, H., & Ghermi, R. (2019). OmniTrack: Real time detection and tracking of objects, text and

logos in video. In Proceedings on ISM, pp. 245-246.

Francis, L. M., & Sreenath, N. (2020). TEDLESS: Text detection using least-square SVM for natural

scene. Journal of King Saud University-Computer and information sciences, 32(3), 87–299.

Gao, Y., Huang, Z., Dai, Y., Chen, K., Guo, J., & Qiu, W. (2019). Wacnet: Word segmentation guided

characters aggregation net for scene text spotting with arbitrary shapes. In Proceedings on ICIP,

pp. 3382–3386.

Gomez, L., & Karatzas, D. (2017). TextProposals: A text-specific selective search algorithm for word

spotting in the wild. Pattern Recognition, 70, 60–74.

Guo, J., You, R., & Hung, L. (2020). Mixed vertical and horizontal text traffic sign detection and recog-

nition for street level scene. IEEE Access, 8, 69413–69425.

Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Synthetic data for text localization in natural images. In

Proceedings on ECCV, pp. 2315–2324.

Huang, R., & Xu, B. (2019). Text attention and focal negative loss for scene text detection. In Proceed-

ings on IJCNN, pp. 1–8.

Hui, L., Peng, W., & Shen, C. (2017). Towards end-to-end text spotting with convolutional recurrent

neural networks. In Proceedings on ICCV, pp. 5248–5256.

Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014). Deep features for text spotting. In D. Fleet, T.

Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer vision – ECCV 2014 (Vol. 8692, p. 2014).

LNCS, Springer.

Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convo-

lutional neural networks. International Journal of Computer Vision, 116, 1–20.

Jha, A., Namboodiri, V. P., & Jawahar, C. V. (2018). Word spotting in silent lip videos. In Proceedings

on WACV, pp. 150–159.

Judd, T., Ehinger, K., Durand, F., & Torralba, A. (2009). Learning to predict where humans look. In

Proceedings on ICCV, pp. 2106–2113.

Jung, H., & Lee, B. G. (2020). Research trends in text mining: Semantic network and main path analysis

of selected journals. Expert Systems with Applications, 162, 113851.

Khalil, A., Jarrath, M., AI-Ayyoub, M., & Jaraweh, Y. (2021). Text detection and script identification in

natural scene images using deep learning. Computers & Electrical Engineering, 91, 107043.

Khan, M. J., Said, N., Khan, A., Rehman, N., & Khurshid, K. (2019). Automated Latin text detection in

document images and natural scene images based on connected component analysis. In Proceed-

ings on iCoMET.

Kota, B. U., Davila, K., Stone, A., Setlur, S., & Govindaraju, V. (2018). Automated detection of hand-

written whiteboard content in lecture videos for summarization. In Proceedings on ICFHR, pp. 19–

24.

Kumar, D. & Singh, R. (2019). A comparative analysis of features extraction algorithms and deep learn-

ing techniques for detection from natural images. In Proceedings on ISCON, pp. 483–487.

Lee, C. H., & Wang, S. H. (2012). An information fusion approach to integrate image annotation and

text mining methods for geographic knowledge discovery. Expert Systems with Applications,

39(10), 8954–8967.

Li, Z., Liu, J., Zhang, G., Huang, Y., Zheng, Y., & Zhang, S. (2021). Learning to predict more accurate

text instances for scene text detection. Neurocomputing, 449, 455–463.

Liao, M., Lyu, P., He, M., Yao, C., Wu, W., & Bai, X. (2021). Mask TextSpotter: An end-to-end trainable

neural network for spotting text with arbitrary shapes. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 43(2), 532–548.

Liu, J., Chen, Z., Du, B., & Tao, D. (2020). ASTS: A unified framework for arbitrary shape text spotting.

IEEE Transactions on Image Processing, 29, 5924–5936.

Liu, J., Zhong, Q., Yuan, Y., Su, H., & Du, B. (2020). SemiText: Scene text detection with semi-super-

vised learning. Neurocomputing, 407, 343–353.

Liu, S., Xian, Y., Li, H., & Yu, Z. (2020). Detection in natural scene images using morphological com-

ponent analysis and Laplacian dictionary. IEEE Journal of Automatic Sinica, 7(1), 214–222.

Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., & Yan, J. (2018). FOTS: Fast oriented text spotting with

a unified network. In Proceedings on CVPR, pp. 5676–5685.

Liu, Y., Chen, H., Shen, C., He, T., Jin, L., & Wang, L. (2020). ABCNet: Real-time scene text spotting

with adaptive Bezier-curve network. In Proceedings on CVPR, pp. 9806–9815.

Liu, Y., Jin, L., & Fang, C. (2020). Arbitrarily shaped scene text detection with a mask tightness text

detector. IEEE Transactions on Image Processing, 29, 2918–2930.

Ma, X., Sun, L., Zhong, Z., & Huo, Q. (2021). ReLaText: Exploiting visual relationships for arbitrarily

shaped scene text detection with graph convolutional networks. Pattern Recognition, 111, 107684.

Mafla, A., Tito, R., Dey, S., Gomez, L., Rusiñol, M., Valveny, E., & Karatzas, D. (2020). Real-time

lexicon-free scene text retrieval. Pattern Recognition, 110, 107656.

Mokayed, H., Shivakumara, P., Woon, H. H., Kankanhalli, M., Tong, L., & Pal, U. (2021). A new DCT-

OCM method for license plate number detection in drone images. Pattern Recognition Letters, 148,

45–53.

Nag, S., Ramachandra, R., Shivakumara, P., Pal, U., Lu, T., & Kankanhalli, M. (2019). CRNN based

Jersey-bib number/text recognition in sports and marathon images. In Proceedings on ICDAR, pp.

1149–1156.

Nag, S., Shivakumara, P., Pal, U., Lu, T., & Blumenstein, M. (2020). A new unified method for detecting

text from marathon runners and sports players in video. Pattern Recognition, 107, 107476.

Panwar, M. A., Memon, K. A., Abro, A., Zhongliang, D., Khuhro, S. A., & Memon, S. (2020). Signboard

detection and recognition using artificial neural networks. In Proceedings on ICEIEC, pp. 16–19.

Pooja, A. and Dhir, R. (2016). Video text extraction and recognition: A survey. In Proceedings on WiSP-

NET, pp. 1366-1373.

Putro, R. A. P., Putri, F. P., & Praseriyowati, M. I. (2019). A combined edge detection analysis and

clustering based approach for real time text detection. In Proceedings on ICNMS, pp. 59-62.

Qin, S., Bissacco, A., Raptis, M., Fujii, Y., & Xiao, Y. (2019). Towards unconstrained end-to-end text

spotting. In Proceedings on ICCV, pp. 4704–4714.

Qin, X., Zhou, Y., Yangn, D., & Wang, W. (2019). Curved text detection in natural scene images with

semi and weakly supervised learning. In Proceedings on ICDAR, pp. 559–564.

Raghunandan, K. S., Shivakumara, P., Roy, S., Kumar, G. H., Pal, U., & Lu, T. (2019). Multi-script

oriented text detection and recognition in video/scene/born digital images. In IEEE transactions on

circuits and systems for video technology, pp. 1145–1161.

Rasheed, J., Jamil, A., Dogru, H. B., Tilki, S., & Yesiltepe, M. (2019). A deep learning based method

for Turkish text detection from videos. In Proceedings on ELECO, pp. 935–939.

Reddy, S., Mathew, M., Gomez, L., Rusinol, M., Kartazas, D. & Jawahar, C. V. (2020). RoadText-1K:

Text detection & recognition dataset for driving videos. In Proceedings on ICRA, pp. 11074–11080.

Rong, X., Yi, C., & Tian, Y. (2020). Unambiguous scene text segmentation with referring expression

comprehension. IEEE Transactions on Image Processing, 29, 591–601.

Roy, P. P., Bhunia, A. K., & Pal, U. (2018). Date-field retrieval in scene image and video frames using

text enhancement and shape coding. Neurocomputing, 274, 37–49.

Roy, P. P., Bhunia, A. K., Bhattacharyya, A., & Pal, U. (2019). Word searching in scene image and

video frame in multi-script scenario using dynamic shape coding. Multimedia Tools and Applica-

tions, 78, 7767–7801.

Roy, P. P., Das, A., Majhi, D., & Pal, U. (2015). Retrieval of scene image and video frames using date

field spotting. In Proceedings on ACPR, pp. 705-709.

Roy, S., Shivakumara, P., Pal, U., Lu, T., & Kumar, G. H. (2020). Delaunay triangulation based text

detection from multi-view images of natural scene. Pattern Recognition Letters, 129, 92–100.

Sabir, A., Moreno-Noguer, F., & Padro, L. (2020). Textual visual semantic dataset for text spotting. In

Proceedings on CVPRW, pp. 2306–2315.

Saha, S., Chakraborty, N., Kundu, S., Paul, S., Mollah, A. F., Basu, S., & Sarkar, R. (2020). Multi-lingual

scene text detection and languageidentification. Pattern Recognition Letters, 138, 16–22.

Sexton, T., Hodkiewicz, M., Brundage, M. P., & Smoker, T. (2018). Benchmarking for keyword extrac-

tion methodologies in maintenance work orders. Annual Conference of the PHM Society, 10(1), 1–

10.

Sharma, N., Pal, U., & Blumenstein, M. (2012). Recent advances in video based documents processing:

A review. In Proceedings on DAS, pp. 63–68.

Shi, B., Bai, X., & Yao, C. (2017). An end-to-end trainable neural network for image-based sequence

recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 39(11), 2298–2304.

Shi, C., Wang, C., Xiao, B., Gao, S., & Hu, J. (2014). Scene text recognition using structure-guided

character detection and linguistic knowledge. IEEE Transactions Circuits and Systems for Video

Technology, 24(7), 1235–1250.

Shivakumara, P., Liang, F., Roy, S., Pal, U., & Lu, T. (2015). New texture-spatial features for keyword

spotting in video images. In Proceedings on ACPR, pp. 391-395.

Shivakumara, P., Roy, S., Jalab, H. A., Ibrahim, R. W., Pal, U., Lu, T., Khare, V., & Wahab, A. B. A.

(2018). Fractional means based method for multi-oriented keyword spotting. Expert Systems with

Applications, 118, 1–19.

Song, H., Wang, H., Huang, S., Xu, P., Huang, X., & Ju, Q. (2019). Text Siamese network for video

textual keyframe detection. In Proceedings on ICDAR, pp. 442–447.

Song, Q., Zhang, R., Zhou, Y., Jiang, Q., Liu, X., Wang, H., & Wang, D. (2019) Reading Chinese scene

text with arbitrary arrangement based on character spotting. In Proceedings on ICDARW, pp. 91–

96.

Song, Z., Zhang, H., & Cui, P. (2019). Towards end-to-end scene text spotting by sharing convolutional

feature map. In Proceedings on ICCC, pp. 1814–1820.

Tarafdar, A., Mandal, R., Pal, S., Pal, U., & Kimura, F. (2010). Shape code based word-image matching

for retrieval of Indian multi-lingual documents. In Proceedings on ICPR, pp. 989–1992.

Tarafdar, A., Pal, U., Roy, P. P., Ragot, N., & Ramel, J. Y. (2013). A two-stage approach for word

spotting in graphical documents. In Proceedings on ICDAR, pp. 319–323.

Tian, Z., Huang, W., He, T., He, P., & Qiao, Y. (2016). Detecting text in natural image with connectionist

text proposal network. In Proceedings on ECCV, pp. 56–72.

Tuna, T., Subhlok, J., & Shah, S. (2011). Indexing and keyword search to ease navigation in lecture

videos. In Proceedings on AIPR.

Tursun, O., Denman, S., Zeng, R., Sivapaplan, S., Sridharan, S., & Fookes, C. (2021). MTRNet++: One

stage mask based scene text eraser. Computer Vision and Image Understanding, 201, 103066.

Wang, F., Ngo, C. W., & Pong, T. C. (2003). Synchronization of lecture videos and electronic slides by

video text analysis. In Proceedings on ACM MM.

Wang, Q., Zheng, Y., & Betke, M. (2020). A method for detecting text of arbitrary shapes in natural

scenes that improves text spotting. In Proceedings on CVPRW, pp. 2296–2305.

Wang, S., Liu, Y., He, Z., Wang, Y., & Tang, Z. (2020). A quadrilateral scene text detector with two-

stage network architecture. Pattern Recognition, 102, 107230.

Wang, Y., Wang, L., Su, F., & Shi, J. (2019). Video text detection with fully convolutional network and

tracking. In Proceedings on ICME, pp. 1738–1743.

Wu, D., Wang, R., Tian, X., Liang, D., & Cao, X. (2018). The keywords spotting with context for multi-

oriented Chinese scene text. In Proceedings on BigMM, pp. 1–5.

Xiao, X., Xu, Y., Zhang, C., Li, X., Zhang, B., & Bian, Z. (2019). A new method for pornographic video

detection with the integration of visual and textual information. In Proceedings on IMCEC, pp. 1600–

1604.

Xiao, Y., Xue, M., Lu, T., Wu, Y., & Palaiahnakote, S. (2019). A text context aware CNN network for

multi-oriented and multi-language scene text detection. In Proceedings on ICDAR, pp. 695–700.

Xu, Y., Wang, Y., Zhou, W., Wang, Y., Yang, Z., & Bai, X. (2019). TextField: Learning and deep direction

field for irregular scene text detection. IEEE Transactions on Image Processing, 28(11), 5566–

5578.

Xue, M., Shivakumara, P., Zhang, C., Xiao, Y., Lu, T., Pal, U., & Lopresti, D. (2020). Arbitrarily-oriented

text detection in low light natural scene images. IEEE Transactions on Multimedia.

https://doi.org/10.1109/TMM.2020.3015037

Yan, H., & Xu, X. (2020). End to end video subtitle recognition via a deep residual neural network.

Pattern Recognition Letters, 131, 368–375.

Yang, H., & Meinel, C. (2014). Content based lecture video retrieval using speech and video text infor-

mation. IEEE Transactions on Learning Technologies, 7(2), 142–154.

Ye, Q., & Doermann, D. (2015). Text detection and recognition in imagery: A survey. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 37, 1480–1500.

Yin, X. C., Zuo, Z. Y., Tian, S., & Liu, C. L. (2016). Text detection, tracking and recognition in video: A

comprehensive survey. IEEE Transactions on Image Processing, 25(6), 2752–2773.

Youngjiu, L., Chunang, L., Minyong, S., & Changxing, S. (2019). Video subtitle location and recognition

based on edge features. In Proceedings on DSA, pp. 455–459.

Yu, H., Zhang, C., Li, X., Han, J., Ding, E., & Wang, L. (2019). An end to end video text detector with

online tracking. In Proceedings on ICDAR, pp. 601–606.

Zamberletti, A., Gallo, I., & Noce, L. (2015). Augmented text character proposals and convolutional

neural networks for text spotting from scene images. In Proceedings on ACPR, pp. 196–200.

Zdenek, J. & Nakayama, H. (2020). Erasing scene text with weak supervision. In Proceedings on

WACV.

Zhang, K., Chen, K., & Fan, B. (2021). Massive picture retrieval system based on big data image mining.

Future Generation Computer Systems, 121, 54–58.

Zheng, Y., Xie, Y., Zu, Y., Yang, X., Li, C., & Zhang, Y. (2020). Scale robust deep oriented text detection

network. Pattern Recognition, 102, 107180.

Zhou, T., Wang, K., Wu, J., & Li, R. (2019). Video text processing method based on image stitching. In

Proceedings on ICIVC, pp. 561–566.

Zhou, Y., Fang, S., Xie, H., Zha, Z., & Zhang, Y. (2019). MLTS: A multi-language scene text spotter. In

Proceedings on ICME, pp. 163–168.

Zhu, Y., & Du, J. (2020). TextMountain: Accurate scene text detection via instance segmentation. Pat-

tern Recognition, 110, 107336.

A preview of this full-text is provided by Wiley.

Learn more

Content available from Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

This content is subject to copyright. Terms and conditions apply.

CLUSTERING MULTIPLE MOVING OBJECTS BASED ON THEIR TRAJECTORIES USING A GRAPH MINING ALGORITHM

Thesis

Full-text available

Dec 2022

Mustafa Asaad Hasan

Computer vision is one of the important scientific fields in the modern era because the video elements contain rich and important information. Hence, knowledge and data van be ontained to refer to a huge amount of useful information. The process of distinguishing and separating only the discovered information is one of the complex and well-known problems. The problem of classification and clustering of moving objects in video data is also a complex task that requires mechanisms, operations, as well as algorithms for the purpose of solving it and obtaining distinct results as possible. In this dissertation, a system is proposed for the purpose of clustering moving objects based on their behavior using a graph mining algorithm. A new algorithm is proposed for the purpose of mining the large data that are represented using a graph. Moreover, another algorithm is proposed for the purpose of data reduction and extracting the important data only. Some of the algorithms used in the proposed system have also been adapted in order to increase their performance. The proposed system firstly splits the video input into sequences frames. The second phase is to apply some preprocessing operations to enhance the quality of frame (still image). The third phase is to apply You Only Lock Once (YOLO) multiple objects detection and Simple Online and Real Time Tracking with a Deep Association Metric (Deep-SORT) tracking objects to discover and track objects with different classes. The fourth phase is to build trajectory for each object and apply a new proposed shape normalization algorithm. The fifth phase is to extract features for trajectories and construct graph for them. The graph data are stored in graph database. The sixth phase is to apply a new iv suggested graph mining algorithm to mine the interested data. Finally, fuzzy c-means is applied to cluster data into a different number of groups. The experimental results suggest that the proposed system is robust with high performance. Algorithms used for detection and tracking outperformed the findings of other detecting and tracking algorithms as they achieved a high accuracy. Moreover, the proposed normalization algorithm shows that about 50% of unrich points are discarded. Furthermore, the graph mining proposed algorithm showed high performance to extract interested data. In addition, the proposed algorithm for graph mining showed a high performance of more than 95% for extracting important data.

School of Economics & Business Department of Organisation Management, Marketing and Tourism Structural review of relics tourism by text mining and machine learning Structural review of relics tourism by text mining and machine learning

Article

Full-text available

Jan 2023

Purpose: The objective of the paper is to find trends of research in relic tourism-related topics. Specifically, this paper uncovers all published studies having latent issues with the keywords “relic tourism” from the Web of Science database. Methods: A total of 109 published articles (2002-2021) were collected related to “relic tourism.” Machine learning tools were applied. Network analysis was used to highlight top researchers in this field, their citations, keyword clusters, and collaborative networks. Text analysis and Bidirectional Encoder Representation from Transformer (BERT) of artificial intelligence model were used to predict text or keyword-based topic reference in machine learning. Results: All the papers are published basically on three primary keywords such as “relics,” “culture,” and “heritage.” Secondary keywords like “protection” and “development” also attract researchers to research this topic. The co-author network is highly significant for diverse authors, and geographically researchers from five countries are collaborating more on this topic. Implications: Academically, future research can be predicated with dense keywords. Journals can bring more special issues related to the topic as relic tourism still has some unexplored areas. Keywords: Text analysis, machine learning, artificial intelligence, topic modeling, relic tourism.

Blockchain Propels Tourism Industry—An Attempt to Explore Topics and Information in Smart Tourism Management through Text Mining and Machine Learning

Article

Full-text available

Jan 2023

Blockchain and immersive technology are the pioneers in bringing digitalization to tourism, and researchers worldwide are exploring many facets of these techniques. This paper analyzes the various aspects of blockchain technology and its potential use in tourism. We explore high-frequency keywords, perform network analysis of relevant publications to analyze patterns, and introduce machine learning techniques to facilitate systematic reviews. We focused on 94 publications from Web Science that dealt with blockchain implementation in tourism from 2017 to 2022. We used Vosviewer for network analysis and artificial intelligence models with the help of machine learning tools to predict the relevance of the work. Many reviewed articles mainly deal with blockchain in tourism and related terms such as smart tourism and crypto tourism. This study is the first attempt to use text analysis to improve the topic modeling of blockchain in tourism. It comprehensively analyzes the technology’s potential use in the hospitality, accommodation, and booking industry. In this context, the paper provides significant value to researchers by giving an insight into the trends and keyword patterns. Tourism still has many unexplored areas; journal articles should also feature special studies on this topic.

Deep learning, graph-based text representation and classification: a survey, perspectives and challenges

Article

Full-text available

Oct 2022
ARTIF INTELL REV

Recently, with the rapid developments of the Internet and social networks, there have been tremendous increase in the amount of complex-structured text resources. These information explosions require extensive studies as well as more advanced methods in order to better understand and effectively model/learn these high-dimensional/structure-complicated textual datasets. Moving along with the recent progresses in deep learning and textual representation learning approaches, many researchers in this domain have been attracted by utilizing different deep neural architectures for learning essential features from texts. These novel neural architectures must enable to handle complex textual feature engineering. Moreover, it also has to be able to extract deeper semantic and structural information from textual resources. Recently, there are several integrations between advanced deep learning architectures, such as recurrent neural networks (RNNs), sequence-to-sequence (seq2seq) and transformers in text classification have been proposed. These hybrid deep neural architectures have shed light on how computers can comprehensively process sequential information from texts to fine-tune for leveraging the performance of multiple tasks in natural language processing, including classification. However, most of recent RNN-based techniques still suffer from several limitations. These limitations are mainly related to the capability of capturing the global long-range dependent as well syntactical structures of the given text corpus. There are some recent studies have shown that a combination of graph-based text representation and graph neural network (GNN) approaches can cope with these challenges. In this survey works, we mainly focus on discussing about recent state-of-the-art studies which are mainly dedicated on the text graph representation learning through GNN, named as TG-GNN. In addition, beside the TG-GNN based models’ features and capability discussions, we also mentioned about the pros/cons. Extensive comparative studies of TG-GNN based techniques in benchmark datasets for text classification problem are also provided in this survey. Finally, we highlight existing challenges as well as identify perspectives which might be useful for future improvements in this research direction.

Building Machine Learning Models for Classification of Text and Non-text Elements in Natural Scene Images

Chapter

Full-text available

Jul 2022

Computer vision aims to build autonomous systems that can perform some of the human visual system’s tasks (and even surpass it in many cases)among the several applications of Computer Vision, extracting the information from the natural scene images is famous and influential. The information gained from an image can vary from identification, space measurements for navigation, or augmented reality applications. These scene images contain relevant text elements as well as many non-text elements. Prior to extracting meaningful information from the text, the foremost task is to classify the text & non-text elements correctly in the given images. The present paper aims to build machine learning models for accurately organizing the text and non-text elements in the benchmark dataset ICDAR 2013. The result is obtained in terms of the confusion matrix to determine the overall accuracy of the different machine learning models. KeywordsNatural scene imagesMachine learning modelsText and non-text componentsClassifiers

Benchmarking for Keyword Extraction Methodologies in Maintenance Work Orders

Article

Full-text available

Sep 2018

Maintenance has largely remained a human-knowledge centered activity, with the primary records of activity being textbased maintenance work orders (MWOs). However, the bulk of maintenance research does not currently attempt to quantify human knowledge, though this knowledge can be rich with useful contextual and system-level information. The underlying quality of data in MWOs often suffers from misspellings, domain-specific (or even workforce specific) jargon, and abbreviations, that prevent its immediate use in computer analyses. Therefore, approaches to making this data computable must translate unstructured text into a formal schema or system; i.e., perform a mapping from informal technical language to some computable format. Keyword spotting (or, extraction) has proven a valuable tool in reducing manual efforts while structuring data, by providing a systematic methodology to create computable knowledge. This technique searches for known vocabulary in a corpus and maps them to designed higher level concepts, shifting the primary effort away from structuring the MWOs themselves, toward creating a dictionary of domain specific terms and the knowledge that they represent. The presented work compares rules-based keyword extraction to data-driven tagging assistance, through quantitative and qualitative discussion of the key advantages and disadvantages. This will enable maintenance practitioners to select an appropriate approach to information encoding that provides needed functionality at minimal cost and effort.

A new DCT-PCM method for license plate number detection in drone images

Article

May 2021
PATTERN RECOGN LETT

License plate number detection in drone images is a complex problem because the images are generally captured at oblique angles and pose several challenges like perspective distortion, non-uniform illumination effect, degradations, blur, occlusion, loss of visibility etc. Unlike, most existing methods that focus on images captured by orthogonal direction (head-on), the proposed work focuses on drone text images. Inspired by the Phase Congruency Model (PCM), which is invariant to non-uniform illuminations, contrast variations, geometric transformation and to some extent to distortion, we explore the combination of DCT and PCM (DCT-PCM) for detecting license plate number text in drone images. Motivated by the strong discriminative power of deep learning models, the proposed method exploits fully connected neural networks for eliminating false positives to achieve better detection results. Furthermore, the proposed work constructs working model that fits for real environment. To evaluate the proposed method, we use our own dataset captured by drones and benchmark license plate datasets, namely, Medialab for experimentation. We also demonstrate the effectiveness of the proposed method on benchmark natural scene text detection datasets, namely, SVT, MSRA-TD-500, ICDAR 2017 MLT and Total-Text.

Learning to Predict More Accurate Text Instances for Scene Text Detection

Article

Apr 2021
NEUROCOMPUTING

At present, multi-oriented text detection methods based on deep neural network have achieved promising performances on various benchmarks. Nevertheless, there are still some difficulties for arbitrary shape text detection, especially for a simple and proper representation of arbitrary shape text instances. In this paper, a pixel-based text detector is proposed to facilitate the representation and prediction of text instances with arbitrary shapes in a simple manner. Firstly, to alleviate the influence of the target vertex sorting and achieve the direct regression of arbitrary shape text instances, the starting-point-independent coordinates regression loss is proposed. Furthermore, to predict more accurate text instances, the text instance accuracy loss is proposed as an assistant task to refine the predicted coordinates under the guidance of IoU. To evaluate the effectiveness of our detector, extensive experiments have been carried on public benchmarks which contain arbitrary shape text instances and multi-oriented text instances. We obtain 84.8% of F-measure on Total-Text benchmark. The results show that our method can reach state-of-the-art performance.

Massive picture retrieval system based on big data image mining

Article

Mar 2021
FUTURE GENER COMP SY

The traditional picture retrieval system has a slow retrieval speed, poor retrieval accuracy, and a low recall when performing massive picture retrieval. In this paper, we design a massive picture retrieval system using the big data image mining technology. It is constructed with data processing layer, business logic layer and presentation layer and works through three steps of data segmentation, mining and merging. For instance, it runs the distributed file system module in a Master/Slave operation mode and designs file read and write requests according to user interaction. Next, it performs parallel computing of picture data sets based on Map Reduce module to solve the picture matching and similarity metrics and returns to the user sorted picture matching result. Then, it extracts the color and texture features of the target area to generate the final picture retrieval result. We select a large number of pictures on a big data platform as simulation test set. The results show that the system we designed has a good retrieval accuracy and a high retrieval speed, which greatly improves the recall of picture retrieval.

Text detection and script identification in natural scene images using deep learning

Article

May 2021
COMPUT ELECTR ENG

The detection of text in an image and identification of its language are important tasks in optical character recognition. Such tasks are challenging, particularly in natural scene images. Previous studies have been conducted with a focus on convolutional neural networks for script identification. In other studies, fully convolutional networks (FCNs) have been used for model enhancement and not as classifiers. In this study, we use FCNs for both model enhancement and classification. The proposed methodology improves the Efficient and Accurate Scene Text Detector by adding new FCN branches for script identification. Moreover, whereas most end-to-end (e2e) methods train the text detection and script identification models separately, we propose two e2e methods for jointly training the models, namely, multi-channel mask (MCM) and multi-channel segmentation (MCS). The results show that the performance of an MCM is similar to that of other state-of-the-art methods, whereas MCS outperforms existing methods with recall values of 54.34% and 81.13%, when using the ICDAR MLT 2017 and MLe2e datasets, respectively.

Multi-Task Learning for Simultaneous Script Identification and Keyword Spotting in Document images

Article

Jan 2021
PATTERN RECOGN

In this paper, an end-to-end multi-task deep neural network was proposed for simultaneous script identification and Keyword Spotting (KWS) in multi-lingual hand-written and printed document images. We introduced a unified approach which addresses both challenges cohesively, by designing a novel CNN-BLSTM architecture. The script identification stage involves local and global features extraction to allow the network to cover more relevant information. Contrarily to the traditional feature fusion approaches which build a linear feature concatenation, we employed a compact bi-linear pooling to capture pairwise correlations between these features. The script identification result is, then, injected in the KWS module to eliminate characters of irrelevant scripts and perform the decoding stage using a single-script mode. All the network parameters were trained in an end-to-end fashion using a multi-task learning that jointly minimizes the NLL loss for the script identification and the CTC loss for the KWS. Our approach was evaluated on a variety of public datasets of different languages and writing types.. Experiments proved the efficacy of our deep multi-task representation learning compared to the state-of-the-art systems for both of keyword spotting and script identification tasks.

Real-time Lexicon-free Scene Text Retrieval

Article

Feb 2021
PATTERN RECOGN

In this work, we address the task of scene text retrieval: given a text query, the system returns all images containing the queried text. The proposed model uses a single shot CNN architecture that predicts bounding boxes and builds a compact representation of spotted words. In this way, this problem can be modeled as a nearest neighbor search of the textual representation of a query over the outputs of the CNN collected from the totality of an image database. Our experiments demonstrate that the proposed model outperforms previous state-of-the-art, while offering a significant increase in processing speed and unmatched expressiveness with samples never seen at training time. Several experiments to assess the generalization capability of the model are conducted in a multilingual dataset, as well as an application of real-time text spotting in videos.

ReLaText: Exploiting visual relationships for arbitrary-shaped scene text detection with graph convolutional networks

Article

Mar 2021
PATTERN RECOGN

We introduce a new arbitrary-shaped text detection approach named ReLaText by formulating text detection as a visual relationship detection problem. To demonstrate the effectiveness of this new formulation, we start from using a “link” relationship to address the challenging text-line grouping problem firstly. The key idea is to decompose text detection into two subproblems, namely detection of text primitives and prediction of link relationships between nearby text primitive pairs. Specifically, an anchor-free region proposal network based text detector is first used to detect text primitives of different scales from different feature maps of a feature pyramid network, from which a text primitive graph is constructed by linking each pair of nearby text primitives detected from a same feature map with an edge. Then, a Graph Convolutional Network (GCN) based link relationship prediction module is used to prune wrongly-linked edges in the text primitive graph to generate a number of disjoint subgraphs, each representing a detected text instance. As GCN can effectively leverage context information to improve link prediction accuracy, our GCN based text-line grouping approach can achieve better text detection accuracy than previous text-line grouping methods, especially when dealing with text instances with large inter-character or very small inter-line spacing. Consequently, the proposed ReLaText achieves state-of-the-art performance on five public text detection benchmarks, namely RCTW-17, MSRA-TD500, Total-Text, CTW1500 and DAST1500.

RoadText-1K: Text Detection & Recognition Dataset for Driving Videos

Conference Paper

May 2020

Cost-effective & Smart Text Sensing & Spotting in Blurry Scene Images using Deep Networks

Article

Sep 2020

An intelligent transportation system facilitates smart services and applications that can revolutionize the traffic and travel experience. Driver assistance system is a crucial part of such a system that helps to improve the safety and security of passengers by mitigating onroad collisions and potential hazards. The precise sensing (localization) and spotting of scene texts and traffic signs are important for achieving higher performance in real-time. It is however affected by motion blur and camera shake noise, which makes the process of spotting complex. In this paper, we propose a robust text spotter, denoted by Blurred TextSpotter, for efficient and cost-effective spotting in blurry scene images. We address different noises, like a motion blur, Gaussian blur, camera shake noise, and interclass interference. We apply a multi-scale contextual information enriched encoder-decoder based backbone network followed by a spatial and channel-wise attentions. We predict text masks and accurately classify words using a hardware-efficient recognition module. The experimental results on five publicly available benchmark datasets show the efficiency of the proposed text spotter in terms of detection, recognition, and spotting of curve text instances in scene images.

Mining text from natural scene and video images: A survey

Abstract and Figures

Recommended publications

Anomaly Detection in Natural Scene Images Based on Enhanced Fine-Grained Saliency and Fuzzy Logic

A New Defect Detection Method for Improving Text Detection and Recognition Performances in Natural S...

A new method for detection and prediction of occluded text in natural scene images

Local Resultant Gradient Vector Difference and Inpainting for 3D Text Detection in the Wild