ArticlePDF Available

Mining text from natural scene and video images: A survey

Authors:

Abstract and Figures

In computer terminology, mining is considered as extracting meaningful information or knowledge from a large amount of data/information using computers. The meaningful information can be extracted from normal text, and images obtained from different resources, such as natural scene images, video, and documents by deriving semantics from text and content of the images. Although there are many pieces of work on text/data mining and several survey/review papers are published in the literature, to the best of our knowledge there is no survey paper on mining textual information from the natural scene, video, and document images considering word spotting techniques. In this article, we, therefore, provide a comprehensive review of both the non‐spotting and spotting based mining techniques. The mining approaches are categorized as feature, learning and hybrid‐based methods to analyze the strengths and limitations of the models of each category. In addition, it also discusses the usefulness of the methods according to different situations and applications. Furthermore, based on the review of different mining approaches, this article identifies the limitations of the existing methods and suggests new applications and future directions to continue the research in multiple directions. We believe such a review article will be useful to the researchers to quickly become familiar with the state‐of‐the‐art information and progresses made toward mining textual information from natural scene and video images. This article is categorized under: Algorithmic Development > Text Mining
This content is subject to copyright. Terms and conditions apply.
Mining Text from Natural Scene and Video Images - A Survey
Palaiahnakote Shivakumara1, Alireza Alaei2, Umapada Pal3, *
1Faculty of Computer Science and Information Technology, University of Malaya, Malaysia.
Email: shiva@um.edu.my
2Faculty of Science and Engineering, Southern Cross University, Australia
Email:ali.alaei@scu.edu.au
3Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata, India
Email: umapada@isical.ac.in
How to cite this article: Shivakumara, P., Alaei, A., & Pal, U. (2021). Mining text from natural scene and
video images: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, e1428.
https://doi.org/10.1002/widm.1428
Abstract
In computer terminology, mining is considered as extracting meaningful information or knowledge from
a large amount of data/information using computers. The meaningful information can be extracted from
normal text, and images obtained from different resources, such as natural scene images, video and
documents by deriving semantics from text and content of the images. Although there are many pieces
of work on text/data mining and several survey/review papers are published in the literature, to the best
of our knowledge there is no survey paper on mining textual information from the natural scene, video
and document images considering word spotting techniques. In this paper, we, therefore, provide a
comprehensive review of both the non-spotting and spotting based mining techniques. The mining ap-
proaches are categorized as feature, learning and hybrid-based methods to analyze the strengths and
limitations of the models of each category. In addition, it also discusses the usefulness of the methods
according to different situations and applications. Furthermore, based on the review of different mining
approaches, this paper identifies the limitations of the existing methods and suggests new applications
and future directions to continue the research in multiple dimensions. We believe such a review article
will be useful to the researchers to quickly get the state-of-the-art information and progresses made
towards mining textual information from natural scene and video images.
Keywords: Document images, Keyword spotting, Natural scene images, Video images, Text mining.
1 INTRODUCTION
Text mining involves automatic discovering/extracting new and/or previously unknown vital and quality
information using machines. It is a multidisciplinary field of research, which incorporates and integrates
different tools and concepts from information retrieval, data mining, machine learning, statistics, and
computational linguistics. Text mining has several applications in many areas, including risk and
knowledge management, cybercrime prevention, content enrichment, and fraud detection. In addition,
it can also assist mining important information from a large database, which contains heterogeneous
and diverse data, such as document, natural scene and video images. Document, natural scene and
video images can be processed to extract their content, layout and logical structures. This process can
further help to extract knowledge from images/videos at different levels of granularity, such as page,
text-line, word, and character. Automated extraction of knowledge from document images can also im-
prove document image analysis applications in different contexts, including document classification and
indexing, document reformatting, and document reconstruction.
Text detection and recognition, especially in the scene and video images, are active research topics in
the domain of document image analysis, in particular, and data/text mining, in general (Shi et al., 2014).
As the text in a natural scene or video image is the main source of semantic information and provides
rich information about the content of the image, there are several real-world text mining applications,
such as contextual advertising, business intelligence, and content enrichment. It has also been shown
2
in the literature that foreground information, including text and salient objects, draw the attention of
viewers(Judd et al., 2009; Alaei et al., 2015; Alaei et al., 2017). Moreover, text detection followed by
recognition is an essential part of several computer vision applications, such as automatic sign reading,
language translation, autonomous car driving, and multimedia retrieval. As an example, an intelligent
transportation system can significantly transform the traffic and travel experience of people. Driver as-
sistance systems and autonomous cars are crucial parts of such a system and improve the safety and
security of passengers (Bagi et al., 2020).
There are several methods for text mining from the images from large datasets in the literature (Zhang
et al. 2021; Lee & Wang, 2012; Jung & Lee, 2020). However, a review of the literature revealed that
most of the methods focus on annotations extracted by content-based image retrieval approaches.
From the literature, it is also noted that these methods are not robust for mining images from a large
diverse dataset because of the gap between the low and high-level features, which do not match with
the actual meaning of the content of the image. This is the main gap between the current methods and
mining applications related to the text in natural scene, video and document images. This has motivated
many researchers to propose text-based approaches for mining images from datasets. As a result,
various methods called spotting techniques are proposed based on text in the images, videos and doc-
uments to extract the exact meaning of the text content instead of annotations derived from low-level
features to bridge this gap. However, most of the research work in the literature aims at extracting
particular information from a specific domain, such as extracting information from only images, only
videos or only documents but not all three together. Due to the popularity of social media, advanced
internet technologies and variations in digitization technologies for capturing data, one can expect di-
verse and heterogeneous datasets, which may include text, images, videos and documents in different
formats. To understand the advances in these domains, we have gathered a list of the most relevant
and recent papers and written a survey on mining text from natural scene, video and document images
under a single platform. This can further allow other researchers to develop new solutions to the chal-
lenges and problems of this research domain.
The rest of the paper is organized as follows. Section 2 highlights the needs and motivation for text
mining in natural scene and video images. Section 3 provides a brief idea of the way the papers are
collected and the survey is conducted. Section 4 deals with non-spotting based mining approaches
where feature based, learning based, and the combination of both are discussed for both natural scene
and video images. Section 5 deals with spotting based mining approaches. Future directions of research
are further discussed in Section 6. Finally, conclusions are drawn in Section 7.
2 MOTIVATION FOR TEXT MINING IN NATURAL SCENE AND VIDEO IMAGES
There are several methods developed for text detection, recognition and keyword spotting from the
document, natural scene and video images. Keyword spotting, as one of the document image analysis
techniques, includes a systematic methodology and framework to facilitate this transformation and help
to create computable knowledge. This technique searches for a known vocabulary in a document image
dataset and maps them to higher-level concepts created by indexing the document images and creating
a dictionary of domain-specific terms and the knowledge they represent (Sexton et al., 2018). As an
important category of approaches for knowledge discovery and mining, and with the increase in gener-
ating textual information in the forms of the scene and video images, word spotting from the scene and
video images has grown in recent years. For the past decade, several researchers from both commu-
nities of computer vision and document analysis have developed powerful methods for scene text de-
tection and recognition. Considering the huge number of methods for text spotting, one can expect to
see different approaches, evaluation schemes, and experiments on different datasets for solving the
same problem. Moreover, due to the use of multiple datasets, evaluation schemes, measures and ap-
proaches for text spotting in the images, it is difficult to analyze the scope, limitation and significance of
the methods. This leads to confusion for the reader and viewer to choose the relevant methods, define
a new challenge and find suitable applications. In addition, scene and video text spotting as textual
mining from the images has been ignored by the community compared with the methods that only
concern either scene text detection or text recognition, separately.
3
Thus, there is a need for a survey on text mining from natural scene and video images to understand
the growth, the objective, scope, limitation, publicly available datasets for experimentation and compar-
ative study. The survey paper can also provide readers with a clear idea of what has been done in the
past and further show them clear directions and new applications for the future researcher. It is worth
noting that there are good survey papers, for example, Dadiya et al. 2019; Pooja et al. 2016; Sharma
et al. 2012; Ye et al. 2015 and Yin et al.2016, which include old models. Several methods have been
proposed in 2019, 2020 and 2021 for addressing different issues of text spotting but there is no survey
paper to provide a summary of the recent research papers(Chekhrouhou et al., 2021; Mokayed et al.,
2021; Li et al., 2021; Khalil et al., 2021). Therefore, this survey mainly focuses on the recently published
methods/models to provide a quick summary of the spotting techniques in this particular domain. More-
over, this survey considers the multilingual models as normal text spotting approaches for reviewing.
As deep learning end-to-end models for text spotting in natural scene and video images are proposed
to avoid preprocessing steps, such as noise removal, deblurring, text alignment issues and the effect
of perspective distortion, reviewing preprocessing methods for text spotting is considered out of the
scope of this survey. In summary, this survey on the mining of text from natural scene and video images
can provide state-of-the-art information to the researchers interested in the context of text mining. It can
further help new researchers to find new challenges, and applications and to investigate new ideas by
referring to existing ideas.
3 BRIEF METHODOLOGY
To identify text detection studies in relation to natural scene and video images in the literature, the
advanced Google search engine rather than Google Scholar, Scopus or Web of Science has been
considered in this study. The choice of Google search helped us to retrieve papers that might not be
indexed in Google Scholar and other repositories. A list of keywords, including “text spotting in natural
scene image”, “text extraction in natural scene image”, “text spotting in video”, “text extraction in video”,
“text spotting in document image”, “text extraction in document image”, “word spotting in document
image”, “text spotting”, “text detection”, “word spotting” and “text extractionwere used to broadly search
and retrieve relevant and recent articles published on different outlets. Moreover, recent review and
survey papers on text and word spotting have been studied to extract relevant text spotting references.
This process resulted in 175 papers in the domain. We then reviewed the title, keywords and abstract
of each paper and exclude them from the list, if they did not contain any form of the words, including
video, scene, image, word, spotting, detection and extraction in their titles, abstracts and the list of their
keywords. As a result, we considered and reviewed a critical mass (95 papers) from the literature of
text spotting in images and videos. It is worth noting that a few (7) other papers have also been included
to enrich the literature review.
To provide a better understanding of the survey, we initially categorized the methods into non-spotting
and spotting approaches. We further classified the methods in each category into scene- and video-
based methods. Concerning the types of methods presented in each category in the literature, features,
learning and hybrid-based methods have been considered as three distinct types of methods.
4 NON-SPOTTING BASED MINING APPROACHES
It is noted that the main focus of the data mining approaches is to extract meaningful information from
large and diverse databases. In the same way, when we consider images and video data collected from
social media, the size and variation of the data can be huge. In this context, extracting meaningful
information from such huge data is not easy for content-based image retrieval methods. This is due to
the gap between the content of the images and the extracted low-level features to appropriately repre-
sent images. To overcome this limitation of the content-based image retrieval methods, text based
methods are proposed to provide exact meaning and relevant information of the content of the images,
where they contain textual information.
These methods can broadly be classified into scene and video text based methods. The scene text
based methods focus on extracting text from natural scene images without temporal information, while
video text based methods focus on text extraction from videos by exploring temporal information. Meth-
ods in the literature of each group can further be categorized based on different perspectives, including
4
i) textual content (richness and sparseness of the image content), ii) type of documents (printed and
handwritten), and (iii) methodological approaches (feature, learning and the combination of both ap-
proaches). In this research work, methods in each category are categorized into three sub-categories,
namely, feature, learning and combination based methods. The schematic diagram of the methods for
extracting text from the images and video can be seen in Fig. 1, where different categories of the meth-
ods for mining information from images and video are represented. The feature-based methods give
more importance to feature extraction for finding a solution to text detection challenges and use con-
ventional classifiers for text extraction from input images or video. These methods may not be accurate
and robust to complex images. To alleviate this problem, learning-based methods are proposed. These
approaches give importance to learning with the ground truth and predefined labels to address the
challenges of text extraction. However, these methods work well only for known datasets and the per-
formance depends on pre-defined samples. Hence, the methods fall short in generic nature. This moti-
vated researchers to introduce the methods that combine both features as well as learning based meth-
ods resulting in feature+ Convolutional Neural Network (CNN) based methods. These methods can
withstand robustness, generic and high accuracy for even complex situations, as they integrate the
advantage of features and deep learning models.
4.1 Methods for Non-Spotting in Natural Scene Images
As mentioned, text mining methods based on non-spotting approaches in natural scene images can be
divided into three sub-categories of feature-based, learning-based and combination of feature and
learning based approaches as shown in Fig. 2. The feature based methods targets extracting unique
properties of text to differentiate text pixels from non-text pixels in an image. The features are extracted
based on the regular pattern of text information, such as the shape of characters, the color of text pixels,
the spacing between characters, the size of the characters, and the orientation of the characters. The
first step of the feature-based method is to remove non-text information from the images. In general,
the methods exploit the above-mentioned properties to retain text pixels and remove non-text pixels
result in a set of text candidates. The text candidates are then used to restore full text information based
on the nearest neighbor criterion, the spatial relationship between the text candidates and the orienta-
tion of text candidates. Bounding boxes are finally fixed for the words or text lines of any orientation by
exploring the concept of polygonal approximation and curve fitting.
The deep learning models of non-spotting in natural and video images can be classified further into
regression/anchor based models (Arafat et al. 2020; Chandio et al, 2020; Huang et al, 2019) and seg-
mentation based models (Dai et al, 2020; Guo et al. 2020; Cai et al, 2020). The former class considers
the whole text as an object for detection while the latter class considers merging pixel by pixel or char-
acter by character for text detection in natural and video images. When the models consider the whole
text as an object, there are chances of misclassifying non-text as text due to text like objects in the
Text Mining Tree
Scene Images
Video
F
L
F + L
F
F+ L
Fig. 1. Mining tree of non-spotting based methods. F, L and F + L denote Features, Learning and combina-
tion of Feature and Learning based method, respectively.
5
background. Therefore, the performance of the models degrades. To overcome this problem, segmen-
tation based models work at the pixel level and these methods are sensitive to complex background
compared to regression-based models.
Learning based methods use ground truth for training models. Most of the models consider pixels or/and
different forms of input images as input for designing the architecture of Neural Network (NN) based
models. As the number of layers in the NNs increases, the ability of the architecture also increases.
Therefore, the models can work on complex situations with high accuracy compared to feature based
methods. At present, the models use different architectures, such as ResNet, U-Net, and GAT, to com-
bine information and be more generic. Like feature-based methods, the outputs of the learning-based
models are text regions. The text regions are then segmented using a simple thresholding criterion. For
fixing bounding boxes, the models use curve fitting concepts and orientation of the text. In the case of
text detection, non-text does not have a boundary and hence getting relevant samples that represent
all possible cases of non-text regions is difficult. As these models are designed based on pre-defined
samples, there are, however, high chances of losing accuracy for totally unknown/unseen input images.
To ease the above limitation, some methods combine feature extraction and deep learning architec-
tures. The combined models integrate the merit of feature-based and deep learning models. Therefore,
feature extraction is generally considered as one layer for text detection in the images. The main ad-
vantage of these models is that they work well with a few samples and they are more generic and
suitable to be used for different datasets and applications. The logic and steps involved in all three
categories are shown in Fig. 2. At the same time, the sample results of text detection for mining from
different cases and situations are shown in Fig. 3, where for each query text, the corresponding method
finds text in the images.
4.1.1 Features Extraction for Text Detection
The recent methods used handcrafted features and conventional classifier for text detection in natural
scene images are listed in Table 1, where the scope and objective, strengths and weakness of the
methods are also presented. From Table 1, it is noted that the methods in the literature have addressed
almost all the challenges of text detection in natural scene images. At the same time, the strengths of
the methods indicate that the methods use different characteristics of character and text for separating
text and non-text pixels before detecting text in the images. Moreover, we observed that most of the
Scene Images
Features (F)
Bounding Box
Fig. 2. The pipeline of text mining from scene images based on conventional
Features (F), Learning (L) and the combination of both (F + L)
Text Candidates
Learning (L)
Labels
Architecture
Mining
Segmentation
Ground truth
F + L
Segmentation
Segmentation
Mining
Mining
6
methods are sensitive to poor quality images. When the images are of poor quality, there are chances
of loss of character shapes and hence text and non-text pixels in the images are classified poorly.
Table 1. Analysis of feature extraction based methods for text detection in natural scene images.
Method
Objective
Strength
Weakness
Francis et al.
(2020)
Text detection
Simple least-square SVM
Use of Otsu thresholding
Roy et al. (2020)
Text detection from multi-
views
Delaunay triangulation
Limited to two views
Raghunandan et
al. (2019)
Multi-script-oriented text de-
tection
Mutual nearest neigh-
bour concept
Sensitive poor quality im-
ages
Guo et al. (2020)
Traffic and text detection
Exploring color features.
Sensitive lighting conditions.
Liu et al. (2020)
Text detection in natural
scene images
Exploring morphological
component analysis
Performance depends on the
size of the sliding window
Panhwaret al.
(2019)
Signboard detection
Artificial neural network
Sensitive to arbitrary orienta-
tion
Khan et al. (2019)
Text detection in both natural
scene and document images
Exploring maximally sta-
ble extremal regions
Sensitive to low contrast and
poor quality images
4.1.2 Learning based Approaches for Text Detection
For the past few years, many methods have been developed using the machine learning concept for
text detection in natural scene images. As can be noted from Table 2, the models are more accurate
for text detection in natural scene images compared to the conventional methods. The models used
different architectures and combined several architectures to address the challenges of text detection.
The performances of the methods in this category highly depend on the number of samples, especially
non-text samples, and relevant samples. The last column of Table 2 presents the different weaknesses
Fig. 3. Examples of text mining from natural scene images using non-spotting based methods.
SVT
MSRA
SVT
ICDAR 2015
Query text
Text spotted
ICDAR2015
7
of the methods, such as being computationally expensive and producing a high number of false posi-
tives. The high number of false positives indicates that collecting and annotating relevant non-text re-
gions is not easy or is sometimes time-consuming.
Table 2. Analysis of learning based methods for text detection in natural scene images.
Method
Objective
Strength
Weakness
Tursun et al.
(2020)
Scene text detection and
erasing
Mask based text inpainting net-
work
The man focus is on inpainting
Wang et al.
(2020)
Scene text detection
Quadrilateral region proposal net-
work
It is not robust to curved text de-
tection.
Bonechi et
al. (2020)
Text detection with a
small dataset.
Weakly supervised learning ap-
proach
The architecture is tailored to a
particular task
Zhu et al.
(2020)
Scene text detection
Text center and border probability
based network.
It is sensitive to the small sized
text
Cai et al.
(2020)
Robust scene text detec-
tion
Hierarchical supervision module
with Inside to outside supervision
network.
It is computationally expensive
and sensitive to the arbitrary
shaped text
Zhenget al.
(2020)
Robust scene text detec-
tion
Multi-scale context features
based network
Sensitive to short text
Liu et al.
(2020)
Scene text detection with
fewer samples
Inductive and transductive semi-
supervised network
Poor performance for dense
text in the images
Ma et al.
(2020)
Arbitrary shaped scene
text detection
Text primitives based graph con-
volutional network
It is vulnerable to false positives
Huang et al.
(2019)
Scene text detection
The fine-grained attention mask
based network
It is vulnerable to false positives
Dai et al.
(2020)
Curved scene text detec-
tion
Multi-scale context aware feature
aggregation based network.
Sensitive to low contrast text
Qin et al.
(2019)
Curved scene text detec-
tion
Semi and weakly supervised
learning based approach
It does not work for dense text
images
Liu et al.
(2020)
Arbitrarily shaped scene
text detection
Mask tightness based network
It is vulnerable to false positives
Chandio et
al. (2019)
Multi-lingual scene text
detection
A fast RCNN network based ap-
proach
It is not effective for font-sized
variations.
Xu et al.
(2019)
Irregular scene text de-
tection
Learning deep direction field
Sensitive to large character
spacing
Arafat et al.
(2020)
Urdu scene text detec-
tion
Faster RCNN based approach
Limited to particular language
text
Xiao et al.
(2019)
Multi-oriented and multi-
language scene text de-
tection
Text context aware scene based
network
It is computationally expensive
4.1.3 Combination of Features and Learning based Approaches for Text Detection
To integrate the strength of feature extraction and deep learning models in a single method, a few
methods have been developed by combing both features and deep learning architectures as listed in
Table 3. The feature based methods obtain dominant information that represents text and can cope
with the challenges of text detection and then the deep learning models use dominant information along
with the input images information for achieving better results. The methods are capable of handling
complex situations and do not require a large number of samples to obtain accurate results. However,
the methods are generally complex and computationally expensive compared to individual feature
based and learning based methods.
8
Table 3. Analysis of the combination of feature extraction and learning based methods for text detection
in natural scene images.
Method
Objective
Strength
Weakness
Saha et al.
(2020)
Multi-lingual scene
text detection
Maximally stable extremal regions and
stroke width transform with generative ad-
versarial network
It is computationally expen-
sive and limited to a particu-
lar language.
Xue et al.
(2020)
Arbitrarily-oriented
low light scene text
detection
Maximally stable extremal regions and the
cloud of line distribution with a convolu-
tional neural network.
It is not robust to images
with various quality
4.1.4 Limitations of the Methods
Despite powerful methods in the literature for text detection in the images, these methods bear several
limitations. In the text detection methods based on words, as long as clear structures or shapes of all
characters, the methods perform well for text detection. For example, if a word contains a few charac-
ters, these methods can define the relationship between characters based on context features and
spatial relationships. When the number of detected characters is less in the words, these methods lose
discriminative power. From the literature, it is evident that most of the methods have used deep learning
for achieving better results. Indeed, deep learning based methods work well when they are trained with
a large number of samples. At the same time, the feature based methods use handcrafted features to
avoid dependency on a large number of samples. However, the feature based methods may not achieve
high text detection results compared to deep learning based methods. To overcome this problem, a
combination of feature based and machine learning approaches has been proposed in the literature.
But the question is how to decide which part of the problem should be taken care of by feature extraction
based part and deep learning based part. In addition, still one can expect some kind of dependency
between features and deep learning models that make some redundancy in feature extraction. In this
situation, what is the trade-off between feature and deep learning models and how to balance both
models is an important factor for consideration.
In complex situations, it is necessary to design a substantial text detection model with a complex struc-
ture, which may be computationally expensive. However, the question is “how to make it computation-
ally efficient without compromising the results and accuracy? Furthermore, how can we design and
develop such models for real-time applications? Answer to these questions leads to a trade-off between
the results and design and the results and efficiency. Therefore, there is a scope for improvement and
inventing new ideas to make text detection methods robust and generic without losing results and effi-
ciency.
4.1.5 Summary
This section focuses on discussing the non-spotting text detection methods in natural scene images
based on handcrafted features, deep learning and the combination of both feature and deep learning.
The analysis of each category for text detection in different context and situations are discussed in this
section. Their advantages and disadvantages according to applications and different situations are also
explained. Scene text images do not provide temporal information for improving the detection results.
Therefore, their applications are limited only to scene text images. For example, it is not possible to
trace the text or objects in a series of images and it does not help to identify the action in the images.
When this temporal information is missing, there is no simple solution to restore the missing information.
This is the motivation to propose the methods for text detection in videos in the literature, which will be
discussed in the subsequent section.
4.2 Methods for Non-Spotting in Video Images
As mentioned in the previous section, the applications for text detection in videos are different from text
detection in natural scene images. For example, action recognition, event identification, tracing, surveil-
lance and monitoring are some of the applications of text detection in videos, where the methods should
use temporal information. The main advantage of these methods is the use of temporal information to
9
estimate motion, predict and restore missing information. The methods in this domain can be catego-
rized into feature based, learning based and the combination of both feature and learning based meth-
ods, which are similar to the methods of text detection in natural scene images as demonstrated in Fig.
4. The feature-based methods usually use temporal information at each stage to enhance the fine de-
tails in the images. For example, in text candidate detection, temporal frames are used to improve the
quality of the image. Due to the low resolution and low contrast of video frames compared to natural
scene images sometimes, the missing information or shapes should be restored. To alleviate the prob-
lem of low contrast and low resolution, most of the time, the feature based methods use temporal frames
for improving the quality of the images. In this way, the video information helps feature-based methods
for mining text in the video.
However, feature-based methods may not be accurate for complex situations as the success of the
method depends on the success of pre-processing steps. To alleviate this issue, learning-based meth-
ods are proposed and used for achieving better results. In all the steps of learning based methods,
temporal information is used to improve the performance of the methods. However, the learning based
methods may not be good for generalization as their performance depends on the size of the training
samples. Therefore, the hybrid methods use both the handcrafted feature and the deep learning model
to overcome the limitations of feature and learning based methods for text mining. In this case, the
output of feature extraction can be considered as input for the deep learning models. The sample results
of text spotting in the video frames at different situations are shown in Fig. 5. From the results, one can
conclude that the feature extraction and deep learning models are complementing each other to achieve
the best results in complex scenarios
4.2.1 Features Extraction for Text Detection in Videos
A list of the text detection methods in videos, their objective, strength and limitations are provided in
Table 4. From Table 4, it is evident that the methods have addressed almost all the challenges of text
detection in video frames. Most of the methods find the fine details of the frames, such as edges, as
the edge is a prominent feature to represent text in the images or video frames. Due to the low contrast
and low resolution of videos, these methods may not be robust to small fonts and poor quality images.
Although the methods use temporal information for improving the quality of the frames, these methods
still lose edges and hence fail to extract the shape or structure of characters. Moreover, in the case of
videos, a frame can have two different types of texts, namely, caption and scene text. Both types have
different characteristics and nature. Therefore, it is not easy to use a feature, which can work for both
types of texts.
Video
Features (F)
Bounding Box
Fig. 4. The pipeline of text mining from video,
Text Candidates
Learning (L)
Labels
Architecture
Mining
Segmentation
Ground truth
Temporal in-
formation
Segmentation
Mining
Mining
10
Table 4. Analysis of the feature extraction based method for text detection in video images.
Method
Objective
Strength
Weakness
Putro et al. (2019)
Real-time text de-
tection
Edge detection and clustering
based approach
The features are not robust to
frame quality
Raghunandan et
al. (2019)
Multi-script-ori-
ented text detec-
tion
Convex hull and deficiency and
clustering based approach
The method is sensitive to too
small fonts and poor quality im-
ages
Youngiu et al.
(2019)
Video text detec-
tion
Edge features based approach
Sensitive to parameters and
thresholds
4.2.2 Learning based Approaches for Text Detection in Videos
To obtain more accurate text detection results, learning based methods using temporal information
have been proposed in the literature. Table 5 demonstrates some methods that used the deep learning
approach for text detection in videos. Despite the use of different architectures in the literature, the
methods are sensitive to false positives. This is true as defining non-text and finding relevant samples
is harder than finding text regions. Therefore, though they use temporal information, there is a high
chance of producing more false positives in the machine learning based methods.
Table 5. Analysis of learning based methods for text detection in video images.
Method
Objective
Strength
Weakness
Nag et al. (2019)
Marathon bib and Jersey num-
ber detection
Deep CNN is explored
Method is limited to Marathon and sports
video
Song et al. (2019)
Video text frame detection
Use of Text Siamese Network
More false positives for complex back-
ground images
Wang et al. (2019)
Video text detection
Hierarchically exploits low-level
features through CNN
The scope is limited to frame detection
but not text detection
Yan et al. (2020)
Subtitle detection in video.
Connectionist text proposal net-
work
It is not robust to achieve good results
Yu et al. (2019)
Video text detection
Use of Convolutional LSTM
It is computationally expensive
Zhou et al. (2019)
Video text detection
YOLO architecture is explored
Application oriented method.
Fig. 5. Examples of text mining based on non-spotting based methods from
video frames of different datasets.
NUS
YVT
ICDAR 2015
ICDAR 2013
Query text
Text Spotted
11
4.2.3 Combination of Features and Learning based Approaches for Text Detection in Videos
To make the methods robust for text detection in videos, a combination of handcrafted features and
deep learning architecture has been proposed in the literature. Table 6 presents these methods that
use feature extraction and deep learning modes differently to obtain the best text detection results.
These methods, however, fail to address the challenges of the small font and non-uniform illumination
effect. When the methods use the combination of features and learning based approaches, the deep
learning models are used as a classifier but not as a feature extractor. Thus, these methods are com-
putationally more efficient compared to fully deep learning based methods.
Table 6. Analysis of the combination of feature and learning based methods for text detection in
video images.
Method
Objective
Strength
Weakness
Fassold et al.
(2019)
Real-time text de-
tection
Features for preprocessing and
YOLO for detection
Sensitive to the number
of temporal frames
Nag et al.
(2020)
Text of marathon
and sports video
The combination gradient magni-
tude and direction along with
CNN
It is not robust t occlu-
sion, blur and too small
fonts
Rasheed et
al. (2019)
Turkish text de-
tection
Deep Convolutional neural net-
work
Sensitive to scaling
Guo et al.
(2020)
Traffic and text
detection
Exploring color features
Sensitive lighting condi-
tions
4.2.4 Limitations of the Methods
There are two major limitations to non-spotting text detection methods in video frames for providing
poor detection results. The first problem is the poor handling of the different nature of two types of text:
caption and scene. Since the nature of scene text is unpredictable and the nature of the caption text is
predictable, it is hard to extract features that work well for both texts. One way to resolve this issue is
to apply a text classification method to classify the caption and scene texts in the video in order to
improve the final text detection results. The second problem is determining the number of temporal
frames for operations. Most of the methods assume the number of temporal frames for the operations.
When the complexity of the problem changes, this constrain may not work well. Thus, it is necessary to
find appropriate ways to determine the number of frames automatically according to the situation.
4.2.5 Summary
Likewise, text detection from images, in the case of video text detection, there are three broad catego-
ries, namely, feature based, learning based, and the combination of features and deep learning models.
Since videos usually suffer from low resolution and contrast, text detection methods generally use tem-
poral information to enhance the quality of the frames and restore missing information in order to obtain
higher text detection accuracies. Moreover, deep learning models use temporal frames(either all or a
number of them) as additional training information to generate more generic models and achieve more
accurate text detection results. Due to processing quite a large number of temporal frames, these meth-
ods need more computational power. This need for a high-performance computing machine causes a
serious issue with video text detection based methods when the architecture of the systems become
complex. Moreover, there is a need for finding a criterion to determine the optimal number of temporal
frames to be used automatically according to the problem complexity.
5. SPOTTING BASED MINING APPROACHES
The methods discussed in the previous section try to separate text and not-text and extract entire text
content from natural scene images and videos. Although text helps us to derive meaningful information
from the scene and video, it lacks the global meaning of the images and video and the extracted text
may not be representative of images or videos. These methods are also computationally expensive.
Therefore, it is motivated researchers to develop methods for spotting text in natural scene images and
12
videos. The spotted text provides a global meaning representing the whole image and video. These
methods are more efficient and accurate compared to the text detection methods especially for retriev-
ing information from a large pool of data. This section discusses the word spotting methods for mining
text in natural scene images and video frames. To keep the consistency of the presentation, methods
in this group are categorized into feature extraction and learning based methods.
5.1 Word Spotting Methods in Natural Scene Images
Text spotting in natural scene images commonly involves two stages: i) text detection and ii) text recog-
nition. There are two categories of approaches, including conventional and end-to-end, for text spotting
in the literature (Hui et al., 2017; Song et al., 2019). The conventional approach comprises a general
pipeline with a text detector module to initially localize the text in a scene image followed by a text
recognizer module to recognize the detected text. The end-to-end text spotting methods can simulta-
neously detect text positions and recognize them. This is in line with the human reading skill, which
performs text detection and recognition in a single shot (Song et al., 2019).
Considering the basic components of a text, text detection methods in the literature (Liu et al., 2018;
Song et al., 2019) can be classified into four categories: character-based, word-based, text-line based,
and fine-scale text proposal based approaches. In character-based methods, individual characters are
initially detected and then they are concatenated to obtain words and text lines using several post-
processing steps, including character filtering and reorganization. Character-based text detection meth-
ods can further be categorized into Connected Components (CC) based and sliding-window based
methods. In CC-based methods, as the most conventional approach of text detection in images, char-
acters are detected by grouping the pixels of similar characteristics, such as color, and intensity to
identify CCs, and then analyzing the properties of the extracted CCs to detect characters among the
set of CCs. The detected characters are then grouped to construct words or text lines. In sliding-window
(region) based methods, different window slides and local features are used to localize characters from
input images (Zamberletti et al., 2015).In word-based text detection methods, words are considered as
different objects, and therefore, these methods are categorized as general object detection methods.
These methods detect word bounding boxes from a large number of word proposals by applying a
filtering strategy based on confidence scored obtained from a trained classifier. To obtain accurate text
bounding boxes, the filtered text proposals will finally be regressed. In text-line based methods, text
lines are firstly detected and then each text-line is further segmented to obtain word bounding boxes.
In fine-scale text proposal methods, word or text-line proposals are initially detected and then the de-
tected text proposals are merged to form complete words or text lines (Zamberletti et al., 2015; Liu et
al., 2018; Song et al., 2019).
The purpose of the text recognition stage is to generate human-readable character sequences (text)
from the variable-length cropped/detected text images. Text recognition methods in the literature can
be categorized into four different groups: character-based, word-based, sequence-to-label decode
based, and sequence-to-sequence based methods (Liu et al., 2018; Song et al., 2019). Character-
based text recognition methods generally consist of three steps, including character detection, and
character recognition followed by character grouping and refining misclassified characters. This ap-
proach largely depends on the results of the character detection step and therefore, accumulated errors
are the major concern in this approach. In word-based text recognition methods, each word is consid-
ered as a whole and holistic word classification is commonly performed to achieve word recognition
(Bagi et al., 2020. A dictionary of segmented words may further need to be considered in this approach.
Sequence-based methods, as an advanced and modern way of text recognition, are widely used in the
literature (Liu et al., 2018; Song et al., 2019).In the sequence-to-label category, a feature sequence is
first extracted from the input image, and then a label sequence is predicted by neural networks (gener-
ally Recurrent Neural Networks (RNN)) providing recognized characters (Liu et al., 2018; Song et al.,
2019). Sequence-to-sequence tries to automatically obtain certain extracted CNN features and implicitly
learn a character-level language model embodied in RNN (Liu et al., 2018).
Recently, deep learning-based approaches have become dominant in both text detection and recogni-
tion stages. For text detection, CNN-based deep learning is usually used to extract feature maps from
a scene image, and then different decoders are used to decode the regions (Tian et al., 2016). For text
recognition, a network for sequential prediction is applied to the extracted text regions (Shi et al., 2017).
13
When the detection and recognition stages are working separately, this would be time and cost con-
suming, especially for images with several text regions. Moreover, the correlation in visual cues shared
in detection and recognition is not considered and the detection network cannot be supervised by labels
from text recognition, and vice versa (Liu et al., 2018).
Most text spotting methods in the literature (Liu et al., 2019), first, generate several text proposals using
a text detection model and then recognize them with a separate text recognition model (Jaderberg et
al., 2016; Gupta et al., 2016). The end-to-end text spotting methods commonly use a text proposal
generation model for text detection and a text recognition method for text spotting. Moreover, text spot-
ting is evolved from simple horizontal text to complicated and challenging situations, curve shape and
multi-directional text (Liu et al., 2018). It is worth noting that the earlier methods in the literature use
handcraft features for scene text spotting. Furthermore, lexicon-free end-to-end text recognition sys-
tems have recently been proposed for scene text spotting (Liao et al., 2019). In the subsequent sub-
section, a detailed discussion on feature based methods for word spotting is provided.
5.1.1 Feature Extraction for Word Spotting
Two types of features, handcrafted-based and deep-learning-based, have been used for text spotting
in images in the literature (Zamberletti et al., 2015; Gomez et al., 2017; Jaderberg et al., 2014; Jader-
berg et al., 2016). The handcrafted features used in the literature include color channels (R, G, B),
foreground intensity, background intensity, foreground Lab color, background Lab color, spatial pyramid
levels, diameter, Gradient, and stroke width. (Gomez et al., 2017). Using the extracted features and a
holistic CNN classifier, a set of word proposals were have been generated without an explicit character
segmentation to obtain word spotting in an end-to-end manner (Gomez et al., 2017). Moreover, a con-
ventional sliding window text detection based on Aggregate Channel Features (ACF) coupled with an
AdaBoost classifier has been used in the literature (Jaderberg et al., 2016). ACF features include nor-
malized gradient magnitude, the histogram of oriented gradients, and the raw grayscale pixel values.
Each channel C has been smoothed, divided into blocks and the pixels in each block were summed
and smoothed again to obtain aggregate channel features. It is noted that the ACF features are not
scale-invariant, so for multi-scale text detection, features at different scales (pyramid) need to be ex-
tracted (Jaderberg et al., 2016). The pyramidal histograms of characters as features have also been
used to represent word images and their textual transcriptions to enable both query-by-example and
query-by-string searches in a unified framework for word searches in handwritten and natural images.
The features are discriminative and the similarity between words is independent of the writing and font
style, illumination, and capturing angle (Almazan et al., 2014). Shape code based word matching for
spotting the word in Indian multilingual documents is proposed by (Tarafdar et al. 2010), where geo-
metrical features, such as extreme points, crossing counts, zonal features, loop-based features are
extracted from the input images. Similarly, the combination of rotation invariant features and SVM clas-
sifier have been used for spotting words in graphical documents and to improve the results of spotting,
SIFT features are also used by (Tarafdar et al. 2013).
Augmented multi-resolution maximally stable extremal regions and convolutional neural networks have
further been employed for text spotting from scene images (Zamberletti et al., 2015). Using simple and
fast geometric transformations on multi-resolution proposals and character augmentation without con-
sidering deep architectures and a large amount of training data provided high text detection rates in
scene images (Zamberletti et al., 2015). Moreover, Pyramid Histogram of Oriented Gradient (PHOG)
features and Zernike moments have been employed in different stages of the proposed two-stage Hid-
den Markov Model (HMM) based framework for keyword detection in video frame/scene images of
multiple scripts. The features have been extracted using a sliding window passed on the binarized text
lines segmented from the scene image/video frames. To improve the performance of the proposed word
spotting framework, a dynamic shape coding using contextual information extracted by adding time
derivatives from the neighbouring windows has further been used in the literature (Roy et al., 2019).
Different convolutional deep learning neural network based methods have recently been used as fea-
ture backbone to extract features in order to appropriately handle the text of different scales (Gao et al.,
2019; Qin et al., 2019). Features have been extracted by using the output of one or more of the hidden
layers in CNN (Gao et al., 2019; Qin et al., 2019). Sharing features extracted from CNN has also been
used to extend a character classification method to character detection and bigram classification. A rich
14
feature set generated by training a strongly supervised character classifier and the intermediate hidden
layers have further been considered as features for text detection, character classification, and bigram
classification (Jaderberg et al., 2014). This method leverages the convolutional structure of a CNN to
process the entire image in a single pass and generate all the features required to detect word bounding
boxes, and then to recognize detected words from a fixed lexicon using the Viterbi algorithm (Jaderberg
et al., 2014).
Moreover, edge boxes have been used in the literature to obtain text word bounding box proposals as
several collections of characters with sharp boundaries (Jaderberg et al., 2016). A region-based feature
extraction using Region-of-Interest (RoI) pooling layer has also been used to generate feature maps
with varying lengths. An RNN encoder has then been employed to encode feature maps of different
lengths into the same size (Hui et al., 2017). A bottom-up method for keyword spotting in multi-oriented
Chinese scene text has been presented by Wu et al. (2018). The method is based on the single-shot
object detection (SSD) method and detects characters and looks for the keywords by considering the
context and relationship between distance and scale of each character pair in the image (Wu et al.,
2018).
5.1.2 Learning based Methods for Word Spotting
Learning based methods for text spotting can be divided into two different categories: conventional
machine learning, and deep learning based approaches. The conventional machine learning based
methods have longer history in the literature of word spotting compared to the deep learning meth-
ods(Gao et al., 2019), whereas deep learning based methods are more advanced and recently attracted
many researchers (Jaderberg et al., 2014; Bazazian et al., 2018).
From the first category of methods, a two-stage word spotting approach based on HMM has been pre-
sented in the literature to detect keywords in multi-script text lines extracted from natural scene images
and video frames(Roy et al., 2019). A script identification has been employed to identify the script of
the line. An unsupervised dynamic shape coding based has then been used to group similar shape
characters to improve the performance. Next, the hypotheses locations have been verified to improve
retrieval performance. The proposed system has been evaluated by searching keywords in natural
scene image and video frames of English and two popular Indic scripts (Roy et al., 2019). In another
system presented in (Almazan et al., 2014) both word images and text information has been combined
with label embedding and attributes learning, and a common subspace regression. The PHOC and
scale-invariant feature transform (SIFT) descriptors of images have been computed to characterize the
images. Word images have first been encoded into feature vectors, and these feature vectors have
been used together with the PHOC labels to learn linear SVM-based attribute models. To learn SIFT
descriptors Gaussian Mixture Models (GMMs) have been utilized. As images and the corresponding
text strings in the images are close together, recognition and retrieval tasks can be seen as the nearest
neighbour problem. The proposed feature representation has a fixed length, is of low dimension, and is
very fast to compute (Almazan et al., 2014). This method can also be positioned within the conventional
methods.
Bazazian et al. (2018) have proposed character probability maps, as an intermediate representation of
images for word spotting. The character probability maps called Soft-PHOC have been obtained based
on the extended concept of the Pyramidal Histogram Of Characters (PHOC) in combination with Fully
Convolutional Networks by computing pixel-wise mapping of the character distribution in candidate
word regions. The Soft-PHOC descriptors have been used for word spotting tasks in egocentric camera
streams using text-line proposals. The text proposals have been extracted based on the application of
Hough Transform on character probability maps and scores obtained using Dynamic Time Warping
(DTW). The benefit of this technique is that there is no need to apply complex post-processing and also
it is not necessary to generate a multi-oriented bounding box proposal with four coordinates for each
proposal. Primary experiments showed that detecting lines proposals was simpler and more efficient
compared with bounding box proposals to detect query words in scene images (Bazazian et al., 2018).
Considering the second category of the methods, the ResNet-152 and the Pyramidal Histogram of
Characters (PHOC) embedding have been combined to build a script-independent multilingual word-
spotting model for Latin, Arabic, and Bangla (Indian) scripts. The proposed deep CNN (DCNN)has been
trained to deal with multilingual word-spotting as multitasking similar to detecting text in wild by the
15
human being. The results obtained from the system indicated that only one deep learning model can
be used to design a script-independent multilingual word-spotting system comparable with the system
using a single model per script. The system is also able to recognize handwritten words in scene images
(Al-Rawi et al., 2019).
Among the methods categorized in the second group, Jaderberg et al. (2014)presented a method com-
posed of two sequential tasks of detecting words regions and recognizing the words within these regions
for word spotting in natural images. These components have further been used together to form an end-
to-end text spotting system for images. A Convolutional Neural Network (CNN) classifier has been de-
signed to handle both tasks. Many layers of the proposed CNN architecture have also been used as
features for text detection, character recognition, and bigram classification (Jaderberg et al., 2014). The
results obtained from the system indicated the significance of jointly learning features to build multiple
strong classifiers (Jaderberg et al., 2014). Jaderberg et al.(2016) have used the same pipeline to first
extract region proposals for text detection. Proposals have then been filtered using a random forest
classifier to reduce the number of false-positive detections. Deep CNNs have been designed to refine
proposals based on bounding box regression and perform word recognition on each refined region
proposal at the same time. Detection and recognition results have been merged and assigned a score
to each text proposal to be able to perform thresholding on the detection results to obtain final text
spotting results(Jaderberg et al., 2016). This pipeline along with a fast subsequent filtering stage en-
sured to obtain high recall for improving the precision. The CNNs have been trained solely on data
produced by a synthetic text generation engine, requiring no human-labeled data (Jaderberg et al.,
2016). This system is fast and scalable as datasets of millions of images can be used for instant text
based image retrieval without any perceivable degradation inaccuracy. Additionally, the recognition
model has been trained purely on synthetic data that allows the system to be easily re-trained for the
recognition of other languages or scripts, without the need for any human labeling data (Jaderberg et
al., 2016). Augmented multi-resolution maximally stable extremal regions and CNNs have further been
used for text spotting from scene images. Moreover, text character proposals have been augmented to
maximize text detection rates by using not very deep architectures and a small amount of training data.
Simple and fast geometric transformations on multi-resolution proposals have finally been used as de-
scriptors to detect text characters (Zamberletti et al., 2015).
Unlike the methods that deal with the problem of text spotting considering the text detection and text
recognition separately, recent deep learning based methods try to integrate the detection and recogni-
tion stages with an end-to-end trainable neural network to get the advantages of the complementarity
of text detection and recognition in a single framework. The method presented in (Hui et al., 2017) is
among the first attempts that used such a concept. In (Hui et al., 2017), a unified framework based on
text proposal network, Recurrent-CNN (R-CNN), and Long-Short Term Memory (LSTM) have been
proposed to simultaneously localize and recognize text with a single forward pass, avoiding image
cropping, feature re-computation, word separation, and character grouping. The framework has been
trained end-to-end, using images, ground-truth bounding boxes and text labels to obtain convolutional
features and use them for both detection and recognition purposes. This multi-task training saves pro-
cessing time and the learned features become more informative, improving overall performance (Hui et
al., 2017).
In a recent work, Bazazian et al.(2018)have designed a fully CNN to generate character attribute
heatmaps for all characters. A rectangle classifier has been used to fuse text proposals and heatmaps
to detect the most likely rectangle for the query word in scene images. The method can handle the
problem of unconstrained word spotting for scene images (Bazazian et al., 2018). Liu et al. (2018) have
also performed text spotting on the oriented text in an end-to-end fashion applying text detection and
recognition simultaneously using a Fast Oriented Text Spotting (FOTS) network. The method has been
built using CNN, which learns and shares features for text detection and recognition. The joint training
method has provided better performance compared to two-stage methods (Liu et al., 2018).
An end-to-end trainable framework called Word Segmentation Guided Characters Aggregation Net
(WAC-Net) has further been developed to spot arbitrary shape text of different scripts in scene im-
ages(Gao et al., 2019). A shared convolutional backbone and the word-level instance-aware segmen-
tation network (WSN) and the char-level detection and recognition network (CDRN) work together to
spot texts in one single forward pass. The WSN and CDRN must jointly be trained by multi-task learning
16
(Gao et al., 2019). Moreover, a trainable neural network called Mask TextSpotter has been presented
to achieve both detection and recognition in text multi-script instances of irregular shapes directly from
two-dimensional space via semantic segmentation. In addition, a spatial attention module has been
used to enhance the performance and generality of the end-to-end text spotting approaches(Liao et al.,
2019). An end-to-end trainable network based on instance segmentation has also been proposed to
simultaneously detect and recognize the text of arbitrary shapes in scene images. An attention model
has further been considered to decode the textual content of each arbitrary shape text region. A simple
RoI masking has finally been employed to extract features from arbitrary shape text regions.
To avoid the feature refinement among the detector and the recognizer, and directly feed features ex-
tracted from the detected text instances to the decoder, the results of an existing OCR engine, as weekly
labeled data, have been used to train the recognition model to improve both the detection and recogni-
tion accuracies (Qin et al., 2019). Song et al. (2019) have further proposed a combination of convolu-
tional and recurrent neural networks by sharing a convolutional feature map to address scene text de-
tection and recognition at the same time. The text has been detected and recognized in a simple forward
propagation to eliminate redundant processes, such as image patch cropping and continually compu-
ting feature maps. The unified neural network has been trained using images and ground-truth bounding
boxes and text labels and promising performances in terms of computation time and accuracy have
been achieved without applying complicated post-processing steps (Song et al., 2019). Zhou et al.
(2019)has also presented another end-to-end deep neural network model called Multi-Language Scene
Text Spotter (MLTS) for multi-language scene text detection, recognition and script identification. A
special backbone for text features and two different types of attention have been considered to achieve
state-of-the-art performance for both text spotting and script identification in natural images (Zhou et
al., 2019). Recently, Bagi et al. (2020) have proposed an end-to-end trainable deep neural network
based on local, global and contextual information of multi-scale feature maps of a lightweight backbone
network for text spotting instances in scene images with background clutters, partially occluded text,
truncation artifacts, and perspective distortions. The problem of inter-class misclassification has been
addressed by maximizing inter-class separability and compacting intra-class variability using Gaussian
softmax. Multi-language character segmentation and word-level recognition have also been incorpo-
rated into the system. The proposed text spotting method provided high accuracies for detecting multi-
lingual text, logos, and symbols in scene images with the cluttered background environment captured
from resource-constrained devices, such as smartphones (Bagi et al., 2020). Furthermore, Liu et al.
(2020a)have introduced an end-to-end trainable unified framework for arbitrary shape text spotting by
integrating holistic-, pixel- and sequence level semantic information into the system. The Mask R-CNN
has been customized to obtain both holistic- and pixel-level semantics for text recognition. The two-
dimensional feature maps extracted from the text spotting task has been fed into an additional text
recognition branch. One-dimensional sequence-level semantics extracted based on an attention-based
sequence-to-sequence network has also been used for text recognition. Finally, the results obtained
from all three levels of semantics have been combined to achieve high accuracies in text recognition
and spotting. Besides, the wide descriptions of texts obtained from the framework enabled the system
to use only word-level weakly annotated data for training a model for robust text spotting (Liu et al.,
2020a). A bottom-up approach for text spotting in scene images was also developed by Fan et al.
(2020). A character detector based on an Extremal Region (ER) detector and an Aggregate Channel
Feature (ACF) detector has been proposed to first detect character candidates with high recall rates.
The real character proposals have then been determined using a CNN filter for high character detection
precision. A hierarchical clustering algorithm, which combines multiple visual and geometrical features,
has finally been designed to group characters into word proposals for word recognition (Fan et al.,
2020). The bottom-up approach for keyword spotting and context extraction in multi-oriented Chinese
scene images has further been presented. The proposed approach includes character detection, key-
words spotting and context extractor of which character-level text detection and recognition have been
performed simultaneously using SSD network. Furthermore, the geometric relationship between key-
words and their context has been analyzed to spot the keywords. Finally, the context extractor has
filtered out the wrong keywords and produce the context of the keywords according to the geometric
location of keywords (Wu et al., 2018).
An Adaptive Bezier-Curve Network (ABCNet) has further been proposed, for the first time, to fit oriented
or curved text by a parameterized Bezier curve. A BezierAlign layer has also been designed for extract-
ing accurate convolution features of text instances that significantly improved the precision measure.
17
Compared with standard bounding box detection, the Bezier curve detection has a negligible computa-
tion overhead. However, the method can handle text spotting and recognition efficiently and accurately
compare to state-of-the-art methods. It is also 10 times faster than recent state-of-the-art methods (Liu
et al., 2020b). To efficiently handle text spotting in blurry scene images, Bagi et al. (2020)has proposed
a text spotter called Blurred TextSpotter. An encoder-decoder, as the backbone network, based on
multi-scale contextual information followed by spatial and channel-wise attention have been considered
in the Blurred TextSpotter. Text masks have accurately been detected and classified using a hardware-
efficient recognition module (Bagi et al., 2020). Different datasets have been used to evaluate the word
spotting methods introduced in the literature.
Visual context information has further been used bySabir et al. (2020) to train/tune and evaluate existing
semantic similarity-based text spotting baselines for re-ranking the produced text hypothesis resulting
in improvement in the accuracy of the text spotting. A visual context dataset has been introduced for
text spotting in the wild by including information, such as a textual image description (caption), the
names of objects and their places in images, about the scene images of the publicly available dataset
COCO-text. This enables researchers to use semantic relations between texts and scenes in their text
spotting systems (Sabir et al., 2020).
As text-line based text spotting methods are unable to handle arbitrary Chinese text (text-lines) in scene
images, a character-based framework composed of three modules, including character detection, char-
acter recognition and character grouping, has been proposed in the literature to spot Chinese text scene
images(Song et al., 2019). A Conditional Random Field (CRF) based character grouping algorithm has
been used to arrange arbitrary Chinese text. The proposed framework achieved superior performance
compared with state-of-the-art text-line based methods when applied to ReCTS-ARB549 dataset (Song
et al., 2019).
Recently, a pipeline of text spotting composed of text detection and recognition has been proposed to
perform text spotting in natural scene images containing complicated backgrounds, various fonts,
shapes, and orientations(Wang et al., 2020). The text detection component called UNet, Heatmap, and
Textfill (UHT) used a UNet to compute heatmaps for candidate text regions and a Textfill algorithm to
produce a polygonal based bounding box for each word in the candidate text region. The UNet has
been trained with ground-truth heatmaps. The proposed text spotting framework, called UHTA, has
been designed by concatenating the UHT and a state-of-the-art text recognition system. The system
has been applied on four public scene-text-detection datasets, including Total-Text, SCUT-CTW1500,
MSRA-TD500, and COCO-Text and the results indicated the effectiveness of the UHT in detecting
multilingual and curved text in scene images (Wang et al., 2020). The proposed method is, however,
complex and needs to tune many parameters. As can be seen from the literature, different datasets
have been used to evaluate the above-mentioned methods. To get an idea of the other datasets used
for evaluation in the literature of this domain, a list of datasets is presented in Table 7. As shown in
Table 7, most of them are various versions of the ICDAR datasets used for the evaluation of text spotting
methods in several ICDAR competitions.
Table 7. Datasets in the literature for text recognition and spotting (Liu et al., 2020b)
Name
Description
Diction-
ary
Number of the
Bounding box
Number
of text
ICDAR17-T2
ICDAR 2017 Task 2: from COCO-Text dataset
-
46K
-
ICDAR13
ICDAR 2013
-
1K
-
SVT
Street View Text
-
647
-
Synth90K
A synthetic dataset with a dictionary of 90K
words
90K
9M
-
ICDAR17-V
Image+Textual dataset from ICDAR17 Task-3
-
10K
25K
COCO-Text-V
Image+Textual dataset from COCO-Text
-
16K
60K
COCO-Paris
Only Textual dataset from COCO-Text
-
-
158K
18
5.1.3 Summary
From the text spotting methods discussed in the above sections, it is noted that these methods have
already addressed different challenges of word spotting in natural scene images. However, these meth-
ods are mostly limited to scene images and most of them may not be suitable for videos. In the case of
videos, due to temporal frames, the methods proposed for natural scene images are firstly not capable
of using temporal information for improving the performance of text spotting. Secondly, they will become
expensive as they have to deal with many temporal frames to perform text spotting. Therefore, specific
methods should be developed for word spotting in videos by exploring temporal information. The fol-
lowing section, therefore, focuses on feature extraction and learning based methods for word spotting
in the video frames.
5.2 Methods for Word Spotting in Video Images
As the volume of generating videos is increasing every day, the automatic retrieval of videos based on
their content is necessary to reduce the time of manual indexing of such a huge number of videos. Text
or word spotting in videos, including lecture videos, has not been explored in the literature compared to
text spotting in scene images (Dutta et al., 2018; Jha et al., 2018). There are various types of texts in
videos, such as running text, and arbitrary text. Running texts in broadcast videos generally appear
horizontally at fixed positions with good contrast and little variation. These characteristics make text
detection in broadcast videos comparatively easier compared to other types of text in videos (Dutta et
al., 2018). However, despite achieving highly accurate recognition rates from text images, not many
text spotting methods for lecture videos have been reported in the literature (Dutta et al., 2018). Lecture
videos are rich with textual information and understanding the textual information can help with better
video understanding and retrieval. Text detection and recognition in presentation slides have been per-
formed to match the slides with the lecture videos (Wang et al., 2003). Edge detection and geometry
based methods followed by a commercial OCR have been used for text spotting (Wang et al., 2003).
Keyword search and video indexing have also been performed using off-the-shelf OCR systems (Tuna
et al., 2011). Furthermore, the combination of Automatic Speech Recognition and OCR methods have
been employed on lecture videos to extract keywords for video text spotting (Yang et al., 2014).
In addition, a recognition-free pipeline for video retrieval has been proposed to retrieve silent speech
videos containing a queried word in the form of a video clip. The method uses video segmentation to
obtain a set of word proposal clips. A similarity measure and threshold have then been used to decide
if a ‘word proposal clipcontained the spotted word. A query expansion technique using pseudo-relevant
feedback and a re-ranking method based on correlation maximization has also been proposed in the
system (Jha et al., 2018). To review the state-of-the-art text spotting methods in videos, in line with the
previous sections three categories, feature, learning based approaches and the combination of both
approaches are considered in the next subsections.
5.2.1 Feature Extraction Methods for Word Spotting
Local appearance and global structure information of characters have been considered for character
recognition in video frames. Part-based tree structures have been used to model each category of
characters to detect and recognize characters simultaneously. The HOG features, as the local appear-
ance descriptor, and color channels with the largest gradient magnitude for color images have been
used in the literature for video and scene text recognition(Benabdelaziz et al., 2020). Structure-guided
character detection and linguistic knowledge have also been considered in the proposed system (Judd
et al., 2009). For word recognition, the detection scores and language model have been combined into
the posterior probability of character sequence from the Bayesian decision view. The final word-recog-
nition result has been obtained by maximizing the character sequence posterior probability via the
Viterbi algorithm (Judd et al., 2009). PHOG features have been computed from the gray and binarized
images for date spotting in natural scene images and video frames with complex backgrounds, blur
noise, and low resolution. Binary and gray image features have been combined by MLP based tandem
approach. The proposed date spotting framework has been built using three different HMMs and ap-
plied to segmented text lines from natural scene images and video frames without segmenting charac-
ters or words (Roy et al., 2015; Roy et al., 2018).
19
Moreover, the combination of Texture-Spatial-Features (TSF) has been used for keyword spotting in
video images of different fonts, contrasts, backgrounds and font sizes without word recognition. The set
of texture features has been extracted to identify text candidates in the segmented word image using
the K-means clustering technique(Shivakumara et al. 2015). The combination of Radon and Fourier
coefficients has been considered to define context features based on coefficient distributions of fore-
ground and background of text candidates. Canny edges extracted from the image and minimum cost
path based ring growing have been used to restore missing text components. These features have
been extracted locally and globally for spotting words from videos, natural scene and license plate
images (Shivakumara et al., 2019). However, these feature extraction methods are generally sensitive
to noise. Moreover, they seem to be selected arbitrary and may not provide high text spotting accuracies
on other datasets.
5.2.2 Learning based Approaches for Word Spotting
To promote research on text spotting in video lectures, a new dataset, called LectureVideoDB, com-
posed of frames from 24 different course videos, including science, management and engineering, has
been introduced in the literature (Dutta et al., 2018). The quality and resolution of videos, camera angle
and its distance from the blackboard vary in the collected videos, but the texts are the focus of cameras.
Experimental results obtained from the existing methods in the literature indicated that the existing
methods need to be improved for accurate text spotting in lecture videos (Dutta et al., 2018).To detect
handwritten text, math expressions and sketches in lecture videos, a deep learning based method was
applied by Kota et al. (2018). The proposed system can generate a summary of the content presented
over time in the lecture while addressing the problem of content occlusion. By employing the proposed
system timestamp-based semantically meaningful bounding box annotations can be provided for the
handwritten whiteboard content in the AccessMath dataset (Kota et al., 2018). A method based on deep
and transfer learning has also been presented for handwritten word retrieval in the literature(Benab-
delaziz et al., 2020). The visual features extracted from both deep and transfer learning methods have
been considered for retrieval experiments on the ICDAR15 word spotting dataset. Six different CNN
architectures and three distance metrics have been used for the experiments. Despite the complexity
of handwritten word spotting, deep CNNs have been tuned using transfer learning to provide efficient
word-spotting (Benabdelaziz et al., 2020).In addition, Mafla et al. (2020) have proposed a single shot
CNN architecture for scene text retrieval to obtain word bounding boxes as a compact representation
of spotted words. The problem has been modeled as a nearest neighbor search of the textual repre-
sentation of the input query over the outputs of the CNN obtained from an image database. The pro-
posed is fast and suitable for multilingual and real-time text spotting in videos (Mafla et al., 2020).
5.2.3 Combination of Feature and Learning based Approaches for Word Spotting
Compared to feature and learning based methods, the hybrid or combination approach for text spotting
from videos has received less attention from researchers. Recently, Mafla et al. (2020) have proposed
a real-time word spotting method based on a fully convolutional neural network to detect and recognize
text in a single pass. The PHOC descriptor has been used to universally encode the presence of a
specific character in a visual region of the proposed bounding box of a language-specific text string
using a CNN based model (YOLOv2 object detection network). The single-shot detection model has
been trained to construct the PHOC by automatically learning character attributes independently and
transferring knowledge acquired at the training phase to build PHOCs of unseen words at inference
time. The proposed PHOC version is a binary vector of size 820 dimensions constructed by concate-
nating the L2 to the L6 unigram levels along with 2 levels of the 50 most common English language
bigrams. As the proposed network uses a smaller filter size in the model’s last filter, it can perform in
real-time (Mafla et al., 2020). Moreover, using a bigger PHOC along with more unigram and bigram
levels can provide superior scene text retrieval results compared with the state of the art results on
different datasets, including multilingual datasets (Mafla et al., 2020). However, this method is lan-
guage-specific and may not be easily applicable to other scripts.
20
5.2.4 Summary
From the literature on text spotting in videos, it can be noted that more attention has been put towards
feature and learning based approaches compared to the hybrid or the combination approach. Interest-
ingly, the feature extraction based methods do not require a large number of samples for training and
have fewer tuning parameters compared to the learning based approach. However, these methods can
adapt to different situations and applications at the cost of accuracy in contrast to learning based meth-
ods. There are hybrid methods that consider the combination of handcrafted features and learning to
achieve better results than individual approaches. However, the main issue with this type of approach
is developing appropriate and effective techniques for feature extraction and learning stages. At the
same time, the performance of the hybrid methods depends on the success of each stage.
5.3 Limitations of the Spotting based Text Mining Methods
We tried to provide a critique of each method during the overview of the literature. Considering the
research work in the literature of text spotting in images and videos, several more limitations can also
be pointed out as follows. Generally, conventional text spotting methods in images and videos highly
depend on binarization, connected component analysis, and segmentation tasks. These tasks are very
sensitive to image/video contrast, complex background, and resolution. Moreover, they are sensitive to
font types and sizes, noise, distortion and degradations. Therefore, these tasks and their pipeline as a
whole, which is called the conventional model, may not provide good keyword spotting in the natural
scene, and video frames (Shivakumara et al., 2019). Some methods in the literature used a recognition
module for text spotting. Though the recognition module can improve text spotting results, these meth-
ods were highly dependent on training data, especially, they require large and unconstrained datasets
for training and verification. In addition, some of the methods in the literature are dependent on word
lexicons with fixed vocabulary, which results in the limitation of the methods for unconstrained text
spotting (Mafla et al., 2020). Moreover, feature extraction methods, especially when dealing with a huge
number of video frames, are computationally expensive. This may affect their suitability for real-time
applications that need to be both efficient and accurate. It is also worth mentioning that most of the
descriptors are sensitive to contrast, background variations and degradations which make the feature-
based text spotting method likely unreliable in images and video frames. The general problem with
deep-learning methods is the need for a large number of training samples that may result in the loss of
their generic property (Shivakumara et al., 2019; Mafla et al., 2020). The use of pre-trained models may
be suitable to be used at different stages of the text spotting problem.
6. FUTURE DIRECTIONS
As future work in the text spotting research domain, the applications of rule mining can be investigated
in three different contexts, namely document classification and retrieval, automated layout correction,
and automated generation of documents. The application of spatial data mining techniques can also be
investigated to find associations between logical components in document images employing document
analysis and understanding methods. As data annotation in document analysis applications is time-
consuming and expensive and at the same time, machine learning based approach for text spotting
requires a huge amount of data for training and fine-tuning parameters of the model, another direction
of research in this domain is using semi-supervised methods trained with weakly annotated data.
It is noted from the review on text spotting in natural scene and video images that none of the methods
can handle all the challenges of keyword spotting in a natural scene and video image. Most methods
focus either on natural scene images or video images but not both. This is because the nature of scene
images and video differ in terms of characteristics and complexities. In addition, in the case of video-
based methods, there is no proper criterion to automatically determine the number of temporal frames
according to the complexity of the problem. Most methods aim to find a solution to the problems but a
few methods have addressed the issue of system and prototype design for real-time applications. When
we look at target applications of text detection and spotting in the natural scene and video images,
retrieval and indexing is the main application. There are forensic and healthcare applications, such as
tracing a bomber or ill-health person with the help of bib number in a marathon, person re-identification
through text spotting on jerseys in case of sports, which can be used for person behavior identification.
21
Text spotting in these applications is challenging because of the short length of text, occlusion and
movements.
Furthermore, the scope of the existing mining approaches is confined to 2D natural scene, video and
document images but not 3D images. However, due to the availability of 3D cameras, scanners and
future 3D TV, one can expect 3D images, 3D TV, 3D movie. In this case, the existing methods may not
be effective or applicable to these types of data. The main reason is that depth data/information creates
shadow and allows to write decorative characters in the text. The presence of shadow and irregular
decorative shaped characters affects the actual shape of the characters in the text. Therefore, the per-
formance of the existing methods declined for such cases. To overcome this problem, one way is to
classify 2D and 3D text images to use existing 2D images and modify the same methods for 3D images.
Another way is to detect shadow and depth information in the 3D images and remove the depth. This
is possible because the pixels representing shadow have low values compared to the pixels represent-
ing text pixels. This results in 2D text images and hence existing 2D methods can be applied for text
spotting. One more way is to develop a new method that can work for both 2D and 3D images without
classifying the images or without shadow removal. As text is common in both 2D and 3D, it is possible
to define context based on recognition results and natural language processing that can help to find
text in both types of images.
Nowadays, many closed-circuit television (CCTV) cameras are fixed in the cities, houses, hotels, malls,
roads, streets to identify crimes and use data as cues and evidence. When the same spot is captured
by multiple CCTV cameras, the same text in the views appears in different forms due to variation in
distance, angle, height from the ground and configuration. In addition, each view may suffer from dif-
ferent adverse effects, such as low resolution, contrast, missing information, and perspective distortion.
As a result, the existing methods may not work well. This is a new direction of research in this domain
to investigate and propose new ideas that can use information in different views to predict the correct
text information.
Another new trend and research topic is text spotting in underwater images and videos. The complexity
of the problem depends on water depth and water clarity. As clarity decreases, the complexity of the
spotting text increases. Spotting text in underwater images is not easy because of the poor quality,
degradation and occlusion of the text. Therefore, the existing methods may not work well for underwater
images. In this case, since the properties of text and water information are different, the method should
explore these cues to enhance the fine details in the text. Then we can use text detection methods for
extraction. Another new application related to forensic, crime and person behavior identification is tattoo
text spotting in the images. Spotting text in those images can help us to study person psychology,
behavior, personality traits, person identification, and gang identification. However, detecting tattoo text
is challenging compared to text in natural scene, video and document images. Tattoo text is handwriting
text with a decorative style and embedded on the skin of different parts of the human body. To find a
solution to this problem, one can think of detecting skin to reduce the complexity of the problem. There
are many methods available for skin detection. Skin detection results can be considered as context to
detect tattoo text in the images. To fix the exact bounding box for each tattoo text line, we need to
explore natural language processing and recognition results because tattoo text lines are connected to
the decorative background and other tattoo text lines. It is also possible to integrate text, image, video
and audio information for text mining from sports-related datasets. This can be considered as another
direction for mining text from sports datasets to understand and analyse different sports or games.
7. CONCLUSIONS
In this research work, we have provided a comprehensive review of the recent advances in the literature
of mining text from the natural scene, video and document images. We identified the objective, signifi-
cance, and scope of different methods. Furthermore, we presented datasets, evaluation schemes, and
measures used in the literature of text spotting in images and videos. With the analysis of methods in
the literature, it has been learnt that for mining meaningful information from the video, scene and doc-
ument images, two typical methods known as non-spotting text detection models and keyword spotting
models can be employed on images and videos. It has further been found that most data mining meth-
ods focus on the content of the images but not the text in the images/videos. Considering technological
22
aspects and the use of different attributes and components in each method, the survey has revealed
that the models in the literature can be categorized and further analyzed based on the types of their
features, and learning strategies. This categorization has highlighted the usefulness, effectiveness and
limitations of each category and model with respect to different applications. The analysis of the meth-
ods in each category has further revealed that feature-based models for both non-spotting and spotting
are good in terms of flexibility, adaptability, and generalization concerning different situations and ap-
plications. In the case of learning based models, we noted that the success of these methods highly
depends on fine-tuning various parameters and the number of training samples. In contrast, the survey
revealed that since hybrid models consider the advantages of both feature engineering and learning
based models, the hybrid models perform better than features and learning based models in complex
situations.
The survey has further revealed that there are several potential applications in the field of text mining,
namely, text mining in 3D videos, sports event mining based on multiple views captured by different
CCVT cameras, and person re-identification from the data captured by multiple cameras. The new ap-
plications pose several challenges and open problems for researchers in the field of text mining. Con-
sidering the limitations observed in current text detection methods on video and natural scene images
as well as new applications and their associated challenges, researchers can find several research
opportunities to investigate and explore new text mining models and solutions for those open challeng-
ing problems.
References
Alaei, A., Conte, D., & Raveaux, R. (2015). Document image quality assessment based on improved
gradient magnitude similarity deviation. In 2016 12th IAPR workshop on document analysis sys-
tems (DAS), pp. 176-180.
Alaei, A., Raveaux, R., & Conte, D. (2017). Image quality assessment based on regions of interest.
SIViP, 11, 673680.
Almazan, J., Gordo, A., Fornés, A., & Valveny, E. (2014). Word spotting and recognition with embed-
ded attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 25522566.
Al-Rawi, M., Valveny, E., & Karatzas, D. (2019). Can one deep learning model learn script-independent
multilingual word-spotting? In 2019 international conference on document analysis and recognition
(ICDAR), pp. 260267.
Arafat, S. Y., & Iqbal, M. J. (2020). Urdu-text detection and recognition in natural scene images in deep
learning. In Proceedings on ELMAR, pp. 9678796803.
Bagi, R., Dutta, T., & Gupta, H. P. (2020a). Cluttered TextSpotter: An end-to-end trainable light-weight
scene text spotter for cluttered environment. IEEE Access, 8, 111433111447.
Bagi, R., & Dutta, T. (2020b). Cost-effective and smart text sensing & spotting in blurry scene images
using deep networks. IEEE Sensors Journal, 11. https://doi.org/10.1109/JSEN.2020.3024257
Bazazian, B., Karatzas, D., & Bagdanov, A. D. (2018a). Word spotting in scene images based on char-
acter recognition. In 2018 IEEE/CVF conference on computer vision and pattern recognition work-
shops (CVPRW), pp. 195319532.
Bazazian, D., Karatzas, D., & Bagdanov, A. D. (2018b). Soft-PHOC descriptor for end-to-end word
spotting in egocentric scene images. In The third international workshop on egocentric perception,
interaction and computing (EPIC) at ECCV2018, pp. 1-9.
Benabdelaziz, R., Gaceb, D., & Haddad, M. (2020). Word-spotting approach using transfer deep learn-
ing of a CNN network. In 2020 1st international conference on communications, control systems
and signal processing (CCSSP), pp. 219-224.
Bonechi, S., Bianchini, M., Scarselli, F., & Andreini, P. (2020). Weak supervision for generating pixel-
level annotations in scene text segmentation. Pattern Recognition Letters, 138, 17.
Brisinello, M., Grabic, R., Vranjes, M. & Vranjes, D Review on text detection methods on scene images.
In 2019 international symposium ELMAR.
Cahndio, A. A. & Pickering, M. (2019). Convolutional feature fusion for multi-language text detection in
natural scene images. In 2019 2nd international conference on computing, mathematics and engi-
neering technologies (iCoMET).
Cai, Y., Wang, W., Chen, Y., & Ye, Q. (2020). IOS-Net: An inside-to-outside supervision network for
scale robust text detection in the wild. Pattern Recognition, 103, 107304.
23
Chandio, A. A., & Pickering, M. (2019). Convolutional feature fusion for multi-language text detection in
natural scene images. In 2019 2nd international conference on computing, mathematics and engi-
neering technologies (iCoMET).
Cheikhrouhou, A., Kessentitini, Y., & Kanoun, S. (2021). Multi-task learning for simultaneous script
identification and keyword spotting in document images. Pattern Recognition, 113.
Dadiya, N. J., & Goswami, M. M. (2019). Multiscript text detection from images: A survey. In 2019
innovations in power and advanced computing technologies (i-PACT), pp. 1-5.
Dai, P., Zhang, H., & Cao, X. (2020). Deep multi-scale context aware feature aggregation for curved
scene text detection. IEEE Transactions on Multimedia, 99, 19691984.
Dutta, K., Mathew, M., Krishnan, P., & Jawahar, C. V. (2018). Localizing and recognizing text in lecture
videos. In 2018 16th international conference on frontiers in handwriting recognition (ICFHR), pp.
235240.
Fan, J., Chen, T., & Zhou, F. (2020). BURSTS: A bottom-up approach for robust spotting of texts in
scenes. Journal of Visual Communication and Image Representation, 71.
Fassold, H., & Ghermi, R. (2019). OmniTrack: Real time detection and tracking of objects, text and
logos in video. In Proceedings on ISM, pp. 245-246.
Francis, L. M., & Sreenath, N. (2020). TEDLESS: Text detection using least-square SVM for natural
scene. Journal of King Saud University-Computer and information sciences, 32(3), 87299.
Gao, Y., Huang, Z., Dai, Y., Chen, K., Guo, J., & Qiu, W. (2019). Wacnet: Word segmentation guided
characters aggregation net for scene text spotting with arbitrary shapes. In Proceedings on ICIP,
pp. 33823386.
Gomez, L., & Karatzas, D. (2017). TextProposals: A text-specific selective search algorithm for word
spotting in the wild. Pattern Recognition, 70, 6074.
Guo, J., You, R., & Hung, L. (2020). Mixed vertical and horizontal text traffic sign detection and recog-
nition for street level scene. IEEE Access, 8, 6941369425.
Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Synthetic data for text localization in natural images. In
Proceedings on ECCV, pp. 23152324.
Huang, R., & Xu, B. (2019). Text attention and focal negative loss for scene text detection. In Proceed-
ings on IJCNN, pp. 18.
Hui, L., Peng, W., & Shen, C. (2017). Towards end-to-end text spotting with convolutional recurrent
neural networks. In Proceedings on ICCV, pp. 52485256.
Jaderberg, M., Vedaldi, A., & Zisserman, A. (2014). Deep features for text spotting. In D. Fleet, T.
Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer vision ECCV 2014 (Vol. 8692, p. 2014).
LNCS, Springer.
Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2016). Reading text in the wild with convo-
lutional neural networks. International Journal of Computer Vision, 116, 120.
Jha, A., Namboodiri, V. P., & Jawahar, C. V. (2018). Word spotting in silent lip videos. In Proceedings
on WACV, pp. 150159.
Judd, T., Ehinger, K., Durand, F., & Torralba, A. (2009). Learning to predict where humans look. In
Proceedings on ICCV, pp. 21062113.
Jung, H., & Lee, B. G. (2020). Research trends in text mining: Semantic network and main path analysis
of selected journals. Expert Systems with Applications, 162, 113851.
Khalil, A., Jarrath, M., AI-Ayyoub, M., & Jaraweh, Y. (2021). Text detection and script identification in
natural scene images using deep learning. Computers & Electrical Engineering, 91, 107043.
Khan, M. J., Said, N., Khan, A., Rehman, N., & Khurshid, K. (2019). Automated Latin text detection in
document images and natural scene images based on connected component analysis. In Proceed-
ings on iCoMET.
Kota, B. U., Davila, K., Stone, A., Setlur, S., & Govindaraju, V. (2018). Automated detection of hand-
written whiteboard content in lecture videos for summarization. In Proceedings on ICFHR, pp. 19
24.
Kumar, D. & Singh, R. (2019). A comparative analysis of features extraction algorithms and deep learn-
ing techniques for detection from natural images. In Proceedings on ISCON, pp. 483487.
Lee, C. H., & Wang, S. H. (2012). An information fusion approach to integrate image annotation and
text mining methods for geographic knowledge discovery. Expert Systems with Applications,
39(10), 89548967.
Li, Z., Liu, J., Zhang, G., Huang, Y., Zheng, Y., & Zhang, S. (2021). Learning to predict more accurate
text instances for scene text detection. Neurocomputing, 449, 455463.
Liao, M., Lyu, P., He, M., Yao, C., Wu, W., & Bai, X. (2021). Mask TextSpotter: An end-to-end trainable
neural network for spotting text with arbitrary shapes. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 43(2), 532548.
24
Liu, J., Chen, Z., Du, B., & Tao, D. (2020). ASTS: A unified framework for arbitrary shape text spotting.
IEEE Transactions on Image Processing, 29, 59245936.
Liu, J., Zhong, Q., Yuan, Y., Su, H., & Du, B. (2020). SemiText: Scene text detection with semi-super-
vised learning. Neurocomputing, 407, 343353.
Liu, S., Xian, Y., Li, H., & Yu, Z. (2020). Detection in natural scene images using morphological com-
ponent analysis and Laplacian dictionary. IEEE Journal of Automatic Sinica, 7(1), 214222.
Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., & Yan, J. (2018). FOTS: Fast oriented text spotting with
a unified network. In Proceedings on CVPR, pp. 56765685.
Liu, Y., Chen, H., Shen, C., He, T., Jin, L., & Wang, L. (2020). ABCNet: Real-time scene text spotting
with adaptive Bezier-curve network. In Proceedings on CVPR, pp. 98069815.
Liu, Y., Jin, L., & Fang, C. (2020). Arbitrarily shaped scene text detection with a mask tightness text
detector. IEEE Transactions on Image Processing, 29, 29182930.
Ma, X., Sun, L., Zhong, Z., & Huo, Q. (2021). ReLaText: Exploiting visual relationships for arbitrarily
shaped scene text detection with graph convolutional networks. Pattern Recognition, 111, 107684.
Mafla, A., Tito, R., Dey, S., Gomez, L., Rusiñol, M., Valveny, E., & Karatzas, D. (2020). Real-time
lexicon-free scene text retrieval. Pattern Recognition, 110, 107656.
Mokayed, H., Shivakumara, P., Woon, H. H., Kankanhalli, M., Tong, L., & Pal, U. (2021). A new DCT-
OCM method for license plate number detection in drone images. Pattern Recognition Letters, 148,
4553.
Nag, S., Ramachandra, R., Shivakumara, P., Pal, U., Lu, T., & Kankanhalli, M. (2019). CRNN based
Jersey-bib number/text recognition in sports and marathon images. In Proceedings on ICDAR, pp.
11491156.
Nag, S., Shivakumara, P., Pal, U., Lu, T., & Blumenstein, M. (2020). A new unified method for detecting
text from marathon runners and sports players in video. Pattern Recognition, 107, 107476.
Panwar, M. A., Memon, K. A., Abro, A., Zhongliang, D., Khuhro, S. A., & Memon, S. (2020). Signboard
detection and recognition using artificial neural networks. In Proceedings on ICEIEC, pp. 1619.
Pooja, A. and Dhir, R. (2016). Video text extraction and recognition: A survey. In Proceedings on WiSP-
NET, pp. 1366-1373.
Putro, R. A. P., Putri, F. P., & Praseriyowati, M. I. (2019). A combined edge detection analysis and
clustering based approach for real time text detection. In Proceedings on ICNMS, pp. 59-62.
Qin, S., Bissacco, A., Raptis, M., Fujii, Y., & Xiao, Y. (2019). Towards unconstrained end-to-end text
spotting. In Proceedings on ICCV, pp. 47044714.
Qin, X., Zhou, Y., Yangn, D., & Wang, W. (2019). Curved text detection in natural scene images with
semi and weakly supervised learning. In Proceedings on ICDAR, pp. 559564.
Raghunandan, K. S., Shivakumara, P., Roy, S., Kumar, G. H., Pal, U., & Lu, T. (2019). Multi-script
oriented text detection and recognition in video/scene/born digital images. In IEEE transactions on
circuits and systems for video technology, pp. 11451161.
Rasheed, J., Jamil, A., Dogru, H. B., Tilki, S., & Yesiltepe, M. (2019). A deep learning based method
for Turkish text detection from videos. In Proceedings on ELECO, pp. 935939.
Reddy, S., Mathew, M., Gomez, L., Rusinol, M., Kartazas, D. & Jawahar, C. V. (2020). RoadText-1K:
Text detection & recognition dataset for driving videos. In Proceedings on ICRA, pp. 1107411080.
Rong, X., Yi, C., & Tian, Y. (2020). Unambiguous scene text segmentation with referring expression
comprehension. IEEE Transactions on Image Processing, 29, 591601.
Roy, P. P., Bhunia, A. K., & Pal, U. (2018). Date-field retrieval in scene image and video frames using
text enhancement and shape coding. Neurocomputing, 274, 3749.
Roy, P. P., Bhunia, A. K., Bhattacharyya, A., & Pal, U. (2019). Word searching in scene image and
video frame in multi-script scenario using dynamic shape coding. Multimedia Tools and Applica-
tions, 78, 77677801.
Roy, P. P., Das, A., Majhi, D., & Pal, U. (2015). Retrieval of scene image and video frames using date
field spotting. In Proceedings on ACPR, pp. 705-709.
Roy, S., Shivakumara, P., Pal, U., Lu, T., & Kumar, G. H. (2020). Delaunay triangulation based text
detection from multi-view images of natural scene. Pattern Recognition Letters, 129, 92100.
Sabir, A., Moreno-Noguer, F., & Padro, L. (2020). Textual visual semantic dataset for text spotting. In
Proceedings on CVPRW, pp. 23062315.
Saha, S., Chakraborty, N., Kundu, S., Paul, S., Mollah, A. F., Basu, S., & Sarkar, R. (2020). Multi-lingual
scene text detection and languageidentification. Pattern Recognition Letters, 138, 1622.
Sexton, T., Hodkiewicz, M., Brundage, M. P., & Smoker, T. (2018). Benchmarking for keyword extrac-
tion methodologies in maintenance work orders. Annual Conference of the PHM Society, 10(1), 1
10.
25
Sharma, N., Pal, U., & Blumenstein, M. (2012). Recent advances in video based documents processing:
A review. In Proceedings on DAS, pp. 6368.
Shi, B., Bai, X., & Yao, C. (2017). An end-to-end trainable neural network for image-based sequence
recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 39(11), 22982304.
Shi, C., Wang, C., Xiao, B., Gao, S., & Hu, J. (2014). Scene text recognition using structure-guided
character detection and linguistic knowledge. IEEE Transactions Circuits and Systems for Video
Technology, 24(7), 12351250.
Shivakumara, P., Liang, F., Roy, S., Pal, U., & Lu, T. (2015). New texture-spatial features for keyword
spotting in video images. In Proceedings on ACPR, pp. 391-395.
Shivakumara, P., Roy, S., Jalab, H. A., Ibrahim, R. W., Pal, U., Lu, T., Khare, V., & Wahab, A. B. A.
(2018). Fractional means based method for multi-oriented keyword spotting. Expert Systems with
Applications, 118, 119.
Song, H., Wang, H., Huang, S., Xu, P., Huang, X., & Ju, Q. (2019). Text Siamese network for video
textual keyframe detection. In Proceedings on ICDAR, pp. 442447.
Song, Q., Zhang, R., Zhou, Y., Jiang, Q., Liu, X., Wang, H., & Wang, D. (2019) Reading Chinese scene
text with arbitrary arrangement based on character spotting. In Proceedings on ICDARW, pp. 91
96.
Song, Z., Zhang, H., & Cui, P. (2019). Towards end-to-end scene text spotting by sharing convolutional
feature map. In Proceedings on ICCC, pp. 18141820.
Tarafdar, A., Mandal, R., Pal, S., Pal, U., & Kimura, F. (2010). Shape code based word-image matching
for retrieval of Indian multi-lingual documents. In Proceedings on ICPR, pp. 9891992.
Tarafdar, A., Pal, U., Roy, P. P., Ragot, N., & Ramel, J. Y. (2013). A two-stage approach for word
spotting in graphical documents. In Proceedings on ICDAR, pp. 319323.
Tian, Z., Huang, W., He, T., He, P., & Qiao, Y. (2016). Detecting text in natural image with connectionist
text proposal network. In Proceedings on ECCV, pp. 5672.
Tuna, T., Subhlok, J., & Shah, S. (2011). Indexing and keyword search to ease navigation in lecture
videos. In Proceedings on AIPR.
Tursun, O., Denman, S., Zeng, R., Sivapaplan, S., Sridharan, S., & Fookes, C. (2021). MTRNet++: One
stage mask based scene text eraser. Computer Vision and Image Understanding, 201, 103066.
Wang, F., Ngo, C. W., & Pong, T. C. (2003). Synchronization of lecture videos and electronic slides by
video text analysis. In Proceedings on ACM MM.
Wang, Q., Zheng, Y., & Betke, M. (2020). A method for detecting text of arbitrary shapes in natural
scenes that improves text spotting. In Proceedings on CVPRW, pp. 22962305.
Wang, S., Liu, Y., He, Z., Wang, Y., & Tang, Z. (2020). A quadrilateral scene text detector with two-
stage network architecture. Pattern Recognition, 102, 107230.
Wang, Y., Wang, L., Su, F., & Shi, J. (2019). Video text detection with fully convolutional network and
tracking. In Proceedings on ICME, pp. 17381743.
Wu, D., Wang, R., Tian, X., Liang, D., & Cao, X. (2018). The keywords spotting with context for multi-
oriented Chinese scene text. In Proceedings on BigMM, pp. 15.
Xiao, X., Xu, Y., Zhang, C., Li, X., Zhang, B., & Bian, Z. (2019). A new method for pornographic video
detection with the integration of visual and textual information. In Proceedings on IMCEC, pp. 1600
1604.
Xiao, Y., Xue, M., Lu, T., Wu, Y., & Palaiahnakote, S. (2019). A text context aware CNN network for
multi-oriented and multi-language scene text detection. In Proceedings on ICDAR, pp. 695700.
Xu, Y., Wang, Y., Zhou, W., Wang, Y., Yang, Z., & Bai, X. (2019). TextField: Learning and deep direction
field for irregular scene text detection. IEEE Transactions on Image Processing, 28(11), 5566
5578.
Xue, M., Shivakumara, P., Zhang, C., Xiao, Y., Lu, T., Pal, U., & Lopresti, D. (2020). Arbitrarily-oriented
text detection in low light natural scene images. IEEE Transactions on Multimedia.
https://doi.org/10.1109/TMM.2020.3015037
Yan, H., & Xu, X. (2020). End to end video subtitle recognition via a deep residual neural network.
Pattern Recognition Letters, 131, 368375.
Yang, H., & Meinel, C. (2014). Content based lecture video retrieval using speech and video text infor-
mation. IEEE Transactions on Learning Technologies, 7(2), 142154.
Ye, Q., & Doermann, D. (2015). Text detection and recognition in imagery: A survey. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 37, 14801500.
Yin, X. C., Zuo, Z. Y., Tian, S., & Liu, C. L. (2016). Text detection, tracking and recognition in video: A
comprehensive survey. IEEE Transactions on Image Processing, 25(6), 27522773.
26
Youngjiu, L., Chunang, L., Minyong, S., & Changxing, S. (2019). Video subtitle location and recognition
based on edge features. In Proceedings on DSA, pp. 455459.
Yu, H., Zhang, C., Li, X., Han, J., Ding, E., & Wang, L. (2019). An end to end video text detector with
online tracking. In Proceedings on ICDAR, pp. 601606.
Zamberletti, A., Gallo, I., & Noce, L. (2015). Augmented text character proposals and convolutional
neural networks for text spotting from scene images. In Proceedings on ACPR, pp. 196200.
Zdenek, J. & Nakayama, H. (2020). Erasing scene text with weak supervision. In Proceedings on
WACV.
Zhang, K., Chen, K., & Fan, B. (2021). Massive picture retrieval system based on big data image mining.
Future Generation Computer Systems, 121, 5458.
Zheng, Y., Xie, Y., Zu, Y., Yang, X., Li, C., & Zhang, Y. (2020). Scale robust deep oriented text detection
network. Pattern Recognition, 102, 107180.
Zhou, T., Wang, K., Wu, J., & Li, R. (2019). Video text processing method based on image stitching. In
Proceedings on ICIVC, pp. 561566.
Zhou, Y., Fang, S., Xie, H., Zha, Z., & Zhang, Y. (2019). MLTS: A multi-language scene text spotter. In
Proceedings on ICME, pp. 163168.
Zhu, Y., & Du, J. (2020). TextMountain: Accurate scene text detection via instance segmentation. Pat-
tern Recognition, 110, 107336.
... Video mining has three main tasks: pre-processing, features and semantic information extraction, video patterns and knowledge discovering and forming. Video mining has different applications and usages such as traffic video sequences, medicine, surveillance system and security programs [6], [7]. Video tracking is the process of utilizing a camera to determine the location of an item that changes its position over time. ...
... The first level is background subtraction. It is a fast method to find which pixels have changed Chapter One General Introduction ‫ـــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــ‬ ‫ــــــــــــــــــــ‬ ‫ـــــــــــــــــــــــ‬ 7 in the image (the foreground). The second level is multi-objects tracking methods such as kernelized correlation filters (KCF). ...
Thesis
Full-text available
Computer vision is one of the important scientific fields in the modern era because the video elements contain rich and important information. Hence, knowledge and data van be ontained to refer to a huge amount of useful information. The process of distinguishing and separating only the discovered information is one of the complex and well-known problems. The problem of classification and clustering of moving objects in video data is also a complex task that requires mechanisms, operations, as well as algorithms for the purpose of solving it and obtaining distinct results as possible. In this dissertation, a system is proposed for the purpose of clustering moving objects based on their behavior using a graph mining algorithm. A new algorithm is proposed for the purpose of mining the large data that are represented using a graph. Moreover, another algorithm is proposed for the purpose of data reduction and extracting the important data only. Some of the algorithms used in the proposed system have also been adapted in order to increase their performance. The proposed system firstly splits the video input into sequences frames. The second phase is to apply some preprocessing operations to enhance the quality of frame (still image). The third phase is to apply You Only Lock Once (YOLO) multiple objects detection and Simple Online and Real Time Tracking with a Deep Association Metric (Deep-SORT) tracking objects to discover and track objects with different classes. The fourth phase is to build trajectory for each object and apply a new proposed shape normalization algorithm. The fifth phase is to extract features for trajectories and construct graph for them. The graph data are stored in graph database. The sixth phase is to apply a new iv suggested graph mining algorithm to mine the interested data. Finally, fuzzy c-means is applied to cluster data into a different number of groups. The experimental results suggest that the proposed system is robust with high performance. Algorithms used for detection and tracking outperformed the findings of other detecting and tracking algorithms as they achieved a high accuracy. Moreover, the proposed normalization algorithm shows that about 50% of unrich points are discarded. Furthermore, the graph mining proposed algorithm showed high performance to extract interested data. In addition, the proposed algorithm for graph mining showed a high performance of more than 95% for extracting important data.
... The word cloud is the visual representation of the words associated with the text data. It is also used to highlight the words as per their frequency and relevance (Shivakumara et al., 2021). The steps of text analysis follow the specific steps mentioned below. ...
Article
Full-text available
Purpose: The objective of the paper is to find trends of research in relic tourism-related topics. Specifically, this paper uncovers all published studies having latent issues with the keywords “relic tourism” from the Web of Science database. Methods: A total of 109 published articles (2002-2021) were collected related to “relic tourism.” Machine learning tools were applied. Network analysis was used to highlight top researchers in this field, their citations, keyword clusters, and collaborative networks. Text analysis and Bidirectional Encoder Representation from Transformer (BERT) of artificial intelligence model were used to predict text or keyword-based topic reference in machine learning. Results: All the papers are published basically on three primary keywords such as “relics,” “culture,” and “heritage.” Secondary keywords like “protection” and “development” also attract researchers to research this topic. The co-author network is highly significant for diverse authors, and geographically researchers from five countries are collaborating more on this topic. Implications: Academically, future research can be predicated with dense keywords. Journals can bring more special issues related to the topic as relic tourism still has some unexplored areas. Keywords: Text analysis, machine learning, artificial intelligence, topic modeling, relic tourism.
... In this work, we apply text analytics to generate a word cloud and plot the most frequent terms used in the abstract and title [73]. The word cloud is the visual representation of the words associated with the text data, highlighting the words' frequency and relevance [74]. The steps of text analysis follow the specific steps mentioned below. ...
Article
Full-text available
Blockchain and immersive technology are the pioneers in bringing digitalization to tourism, and researchers worldwide are exploring many facets of these techniques. This paper analyzes the various aspects of blockchain technology and its potential use in tourism. We explore high-frequency keywords, perform network analysis of relevant publications to analyze patterns, and introduce machine learning techniques to facilitate systematic reviews. We focused on 94 publications from Web Science that dealt with blockchain implementation in tourism from 2017 to 2022. We used Vosviewer for network analysis and artificial intelligence models with the help of machine learning tools to predict the relevance of the work. Many reviewed articles mainly deal with blockchain in tourism and related terms such as smart tourism and crypto tourism. This study is the first attempt to use text analysis to improve the topic modeling of blockchain in tourism. It comprehensively analyzes the technology’s potential use in the hospitality, accommodation, and booking industry. In this context, the paper provides significant value to researchers by giving an insight into the trends and keyword patterns. Tourism still has many unexplored areas; journal articles should also feature special studies on this topic.
... These classifiers are later used to facilitate multiple forms of principal textual analysis and mining problems, such as: text retrieval, topic identification/modelling, sentiment/opinion analysis, fake news/spam detection, etc. In very times, the rapid developments of the Internet have led to the dramatic emergence of huge textual resources from Internet's large-scaled online platforms, such as: social networks (Facebook, Twitter, etc.) (Zucco et al. 2020), social media platforms (YouTube, Instagram, etc.) (Shivakumara et al. 2021), open encyclopedia/knowledge graphs (Wikipedia, YAGO, etc.) etc. In fact, there is a greater need for developing more powerful tools/systems which can effectively support to extract valuable knowledge from these huge textual resources. ...
Article
Full-text available
Recently, with the rapid developments of the Internet and social networks, there have been tremendous increase in the amount of complex-structured text resources. These information explosions require extensive studies as well as more advanced methods in order to better understand and effectively model/learn these high-dimensional/structure-complicated textual datasets. Moving along with the recent progresses in deep learning and textual representation learning approaches, many researchers in this domain have been attracted by utilizing different deep neural architectures for learning essential features from texts. These novel neural architectures must enable to handle complex textual feature engineering. Moreover, it also has to be able to extract deeper semantic and structural information from textual resources. Recently, there are several integrations between advanced deep learning architectures, such as recurrent neural networks (RNNs), sequence-to-sequence (seq2seq) and transformers in text classification have been proposed. These hybrid deep neural architectures have shed light on how computers can comprehensively process sequential information from texts to fine-tune for leveraging the performance of multiple tasks in natural language processing, including classification. However, most of recent RNN-based techniques still suffer from several limitations. These limitations are mainly related to the capability of capturing the global long-range dependent as well syntactical structures of the given text corpus. There are some recent studies have shown that a combination of graph-based text representation and graph neural network (GNN) approaches can cope with these challenges. In this survey works, we mainly focus on discussing about recent state-of-the-art studies which are mainly dedicated on the text graph representation learning through GNN, named as TG-GNN. In addition, beside the TG-GNN based models’ features and capability discussions, we also mentioned about the pros/cons. Extensive comparative studies of TG-GNN based techniques in benchmark datasets for text classification problem are also provided in this survey. Finally, we highlight existing challenges as well as identify perspectives which might be useful for future improvements in this research direction.
Chapter
Full-text available
Computer vision aims to build autonomous systems that can perform some of the human visual system’s tasks (and even surpass it in many cases)among the several applications of Computer Vision, extracting the information from the natural scene images is famous and influential. The information gained from an image can vary from identification, space measurements for navigation, or augmented reality applications. These scene images contain relevant text elements as well as many non-text elements. Prior to extracting meaningful information from the text, the foremost task is to classify the text & non-text elements correctly in the given images. The present paper aims to build machine learning models for accurately organizing the text and non-text elements in the benchmark dataset ICDAR 2013. The result is obtained in terms of the confusion matrix to determine the overall accuracy of the different machine learning models. KeywordsNatural scene imagesMachine learning modelsText and non-text componentsClassifiers
Article
Full-text available
Maintenance has largely remained a human-knowledge centered activity, with the primary records of activity being textbased maintenance work orders (MWOs). However, the bulk of maintenance research does not currently attempt to quantify human knowledge, though this knowledge can be rich with useful contextual and system-level information. The underlying quality of data in MWOs often suffers from misspellings, domain-specific (or even workforce specific) jargon, and abbreviations, that prevent its immediate use in computer analyses. Therefore, approaches to making this data computable must translate unstructured text into a formal schema or system; i.e., perform a mapping from informal technical language to some computable format. Keyword spotting (or, extraction) has proven a valuable tool in reducing manual efforts while structuring data, by providing a systematic methodology to create computable knowledge. This technique searches for known vocabulary in a corpus and maps them to designed higher level concepts, shifting the primary effort away from structuring the MWOs themselves, toward creating a dictionary of domain specific terms and the knowledge that they represent. The presented work compares rules-based keyword extraction to data-driven tagging assistance, through quantitative and qualitative discussion of the key advantages and disadvantages. This will enable maintenance practitioners to select an appropriate approach to information encoding that provides needed functionality at minimal cost and effort.
Article
License plate number detection in drone images is a complex problem because the images are generally captured at oblique angles and pose several challenges like perspective distortion, non-uniform illumination effect, degradations, blur, occlusion, loss of visibility etc. Unlike, most existing methods that focus on images captured by orthogonal direction (head-on), the proposed work focuses on drone text images. Inspired by the Phase Congruency Model (PCM), which is invariant to non-uniform illuminations, contrast variations, geometric transformation and to some extent to distortion, we explore the combination of DCT and PCM (DCT-PCM) for detecting license plate number text in drone images. Motivated by the strong discriminative power of deep learning models, the proposed method exploits fully connected neural networks for eliminating false positives to achieve better detection results. Furthermore, the proposed work constructs working model that fits for real environment. To evaluate the proposed method, we use our own dataset captured by drones and benchmark license plate datasets, namely, Medialab for experimentation. We also demonstrate the effectiveness of the proposed method on benchmark natural scene text detection datasets, namely, SVT, MSRA-TD-500, ICDAR 2017 MLT and Total-Text.
Article
At present, multi-oriented text detection methods based on deep neural network have achieved promising performances on various benchmarks. Nevertheless, there are still some difficulties for arbitrary shape text detection, especially for a simple and proper representation of arbitrary shape text instances. In this paper, a pixel-based text detector is proposed to facilitate the representation and prediction of text instances with arbitrary shapes in a simple manner. Firstly, to alleviate the influence of the target vertex sorting and achieve the direct regression of arbitrary shape text instances, the starting-point-independent coordinates regression loss is proposed. Furthermore, to predict more accurate text instances, the text instance accuracy loss is proposed as an assistant task to refine the predicted coordinates under the guidance of IoU. To evaluate the effectiveness of our detector, extensive experiments have been carried on public benchmarks which contain arbitrary shape text instances and multi-oriented text instances. We obtain 84.8% of F-measure on Total-Text benchmark. The results show that our method can reach state-of-the-art performance.
Article
The traditional picture retrieval system has a slow retrieval speed, poor retrieval accuracy, and a low recall when performing massive picture retrieval. In this paper, we design a massive picture retrieval system using the big data image mining technology. It is constructed with data processing layer, business logic layer and presentation layer and works through three steps of data segmentation, mining and merging. For instance, it runs the distributed file system module in a Master/Slave operation mode and designs file read and write requests according to user interaction. Next, it performs parallel computing of picture data sets based on Map Reduce module to solve the picture matching and similarity metrics and returns to the user sorted picture matching result. Then, it extracts the color and texture features of the target area to generate the final picture retrieval result. We select a large number of pictures on a big data platform as simulation test set. The results show that the system we designed has a good retrieval accuracy and a high retrieval speed, which greatly improves the recall of picture retrieval.
Article
The detection of text in an image and identification of its language are important tasks in optical character recognition. Such tasks are challenging, particularly in natural scene images. Previous studies have been conducted with a focus on convolutional neural networks for script identification. In other studies, fully convolutional networks (FCNs) have been used for model enhancement and not as classifiers. In this study, we use FCNs for both model enhancement and classification. The proposed methodology improves the Efficient and Accurate Scene Text Detector by adding new FCN branches for script identification. Moreover, whereas most end-to-end (e2e) methods train the text detection and script identification models separately, we propose two e2e methods for jointly training the models, namely, multi-channel mask (MCM) and multi-channel segmentation (MCS). The results show that the performance of an MCM is similar to that of other state-of-the-art methods, whereas MCS outperforms existing methods with recall values of 54.34% and 81.13%, when using the ICDAR MLT 2017 and MLe2e datasets, respectively.
Article
In this paper, an end-to-end multi-task deep neural network was proposed for simultaneous script identification and Keyword Spotting (KWS) in multi-lingual hand-written and printed document images. We introduced a unified approach which addresses both challenges cohesively, by designing a novel CNN-BLSTM architecture. The script identification stage involves local and global features extraction to allow the network to cover more relevant information. Contrarily to the traditional feature fusion approaches which build a linear feature concatenation, we employed a compact bi-linear pooling to capture pairwise correlations between these features. The script identification result is, then, injected in the KWS module to eliminate characters of irrelevant scripts and perform the decoding stage using a single-script mode. All the network parameters were trained in an end-to-end fashion using a multi-task learning that jointly minimizes the NLL loss for the script identification and the CTC loss for the KWS. Our approach was evaluated on a variety of public datasets of different languages and writing types.. Experiments proved the efficacy of our deep multi-task representation learning compared to the state-of-the-art systems for both of keyword spotting and script identification tasks.
Article
In this work, we address the task of scene text retrieval: given a text query, the system returns all images containing the queried text. The proposed model uses a single shot CNN architecture that predicts bounding boxes and builds a compact representation of spotted words. In this way, this problem can be modeled as a nearest neighbor search of the textual representation of a query over the outputs of the CNN collected from the totality of an image database. Our experiments demonstrate that the proposed model outperforms previous state-of-the-art, while offering a significant increase in processing speed and unmatched expressiveness with samples never seen at training time. Several experiments to assess the generalization capability of the model are conducted in a multilingual dataset, as well as an application of real-time text spotting in videos.
Article
We introduce a new arbitrary-shaped text detection approach named ReLaText by formulating text detection as a visual relationship detection problem. To demonstrate the effectiveness of this new formulation, we start from using a “link” relationship to address the challenging text-line grouping problem firstly. The key idea is to decompose text detection into two subproblems, namely detection of text primitives and prediction of link relationships between nearby text primitive pairs. Specifically, an anchor-free region proposal network based text detector is first used to detect text primitives of different scales from different feature maps of a feature pyramid network, from which a text primitive graph is constructed by linking each pair of nearby text primitives detected from a same feature map with an edge. Then, a Graph Convolutional Network (GCN) based link relationship prediction module is used to prune wrongly-linked edges in the text primitive graph to generate a number of disjoint subgraphs, each representing a detected text instance. As GCN can effectively leverage context information to improve link prediction accuracy, our GCN based text-line grouping approach can achieve better text detection accuracy than previous text-line grouping methods, especially when dealing with text instances with large inter-character or very small inter-line spacing. Consequently, the proposed ReLaText achieves state-of-the-art performance on five public text detection benchmarks, namely RCTW-17, MSRA-TD500, Total-Text, CTW1500 and DAST1500.
Article
An intelligent transportation system facilitates smart services and applications that can revolutionize the traffic and travel experience. Driver assistance system is a crucial part of such a system that helps to improve the safety and security of passengers by mitigating onroad collisions and potential hazards. The precise sensing (localization) and spotting of scene texts and traffic signs are important for achieving higher performance in real-time. It is however affected by motion blur and camera shake noise, which makes the process of spotting complex. In this paper, we propose a robust text spotter, denoted by Blurred TextSpotter, for efficient and cost-effective spotting in blurry scene images. We address different noises, like a motion blur, Gaussian blur, camera shake noise, and interclass interference. We apply a multi-scale contextual information enriched encoder-decoder based backbone network followed by a spatial and channel-wise attentions. We predict text masks and accurately classify words using a hardware-efficient recognition module. The experimental results on five publicly available benchmark datasets show the efficiency of the proposed text spotter in terms of detection, recognition, and spotting of curve text instances in scene images.