ArticlePDF Available

A Computer Vision-Based Object Localization Model for Endangered Wildlife Detection

Authors:
A computer vision-based object localization model for endangered wildlife
detection
Arunabha M. Roy1, Jayabrata Bhaduri2, Teerath Kumar3, and Kislay Raj3
1Aerospace Engineering Department, University of Michigan, Ann Arbor, MI
48109, USA
2
Capacloud AI, Deep Learning &Data Science Division, Kolkata, WB 711103,
India.
3School of Computing, Dublin City University, Dublin 9, Ireland
Abstract
Objective. With climatic instability, various ecological disturbances, and human actions threaten
the existence of various endangered wildlife species. Therefore, an up-to-date accurate and
detailed detection process plays an important role in protecting biodiversity losses, conservation,
and ecosystem management. Current state-of-the-art wildlife detection models, however, often
lack superior feature extraction capability in complex environments, limiting the development
of accurate and reliable detection models. Method. To this end, we present WilDect-YOLO, a
deep learning (DL)-based automated high-performance detection model for real-time endangered
wildlife detection. In the model, we introduce a residual block in the CSPDarknet53 backbone
for strong and discriminating deep spatial features extraction and integrate DenseNet blocks to
improve in preserving critical feature information. To enhance receptive field representation,
preserve fine-grain localized information, and improve feature fusion, a Spatial Pyramid Pooling
Corresponding author, 4/09/2022
1
2
(SPP) and modified Path Aggregation Network (PANet) have been implemented that results
in superior detection under various challenging environments. Results. Evaluating the model
performance in a custom endangered wildlife dataset considering high variability and complex
backgrounds, WilDect-YOLO obtains a mean average precision (mAP) value of 96
.
89%, F1-score
of 97
.
87%, and precision value of 97
.
18% at a detection rate of 59.20 FPS outperforming current
state-of-the-art models. Significance. The present research provides an effective and efficient
detection framework addressing the shortcoming of existing DL-based wildlife detection models
by providing highly accurate species-level localized bounding box prediction. Current work
constitutes a step towards a non-invasive, fully automated animal observation system in
real-time in-field applications.
Keywords: Endangered wildlife detection; You Only Look Once (YOLOv4) algorithm; Object
Detection (OD); Computer vision; Deep Learning (DL); Wildlife Preservation
1. Introduction :
In recent years, automated wildlife detection plays a critical role in wildlife survey (Peng
et al.,2020;Chalmers et al.,2021;Delplanque et al.,2021), conservation (Khaemba and Stein,
2002;O’Brien,2010), and ecosystem management (Austrheim et al.,2014;Harris et al.,2010)
to tackle worldwide accelerated biodiversity crisis. Up-to-date detailed and accurate wildlife
data can be beneficial in preventing biodiversity losses, ecosystem damage, and poaching
(Norouzzadeh et al.,2018;Petso et al.,2021). While traditional wildlife survey techniques
mainly include distance sampling (Aebischer et al.,2017), camera trapping (Chauvenet et al.,
2017), and satellite monitoring (Chauvenet et al.,2017), however, such traditional techniques
have disadvantages due to lower efficiency, high cost, the requirement of qualified personals, and
their individual bias (Guo et al.,2018). Similarly, wild animal surveys with aerial image object
detection generally suffer from low accuracy due to complex backgrounds and disturbances
among wild animals (Eikelboom et al.,2019). Moreover, satellite-based monitoring methods
require very-high-resolution satellite imagery which are limited for relatively larger-sized animals
(Wang et al.,2019).
3
To circumvent such issues, various automatic and semi-automatic detection algorithms for
wildlife animals have been adopted, in particular, from unmanned aircraft systems (UASs)
imagery (Gonzalez et al.,2016;Ofli et al.,2016). Additionally, pixel-based classification methods
that include threshold setting, supervised, and unsupervised classification have been popular
methods for detecting animals in remote sensing images (Pringle et al.,2009;Kudo et al.,2012).
However, these methods are not adequate for detecting targets with similar gray-scale values
with the complex background (Wang et al.,2019). To detect targets in complex environments,
various machine learning (ML) methods have been employed to localize objects combining
rotation-invariant object descriptors for automated wildlife detection (Cheng and Han,2016).
Although, traditional ML yields encouraging results in relatively simple scenarios, however,
they are not adequate and robust methods for detecting complicated animal features such as
structure, texture, morphology, etc (Rey et al.,2017;Peng et al.,2020).
More recently, driven by big-data methods (Khan et al.,2022a), deep learning (DL)
characterized by multilayer neural networks (NN) (LeCun et al.,2015) has shown remarkable
breakthroughs in pattern recognition for various fields including image classification (Rawat
and Wang,2017;Jamil et al.,2022;Khan et al.,2022b), computer vision (Voulodimos et al.,
2018;Chandio et al.,2022), object detection (Zhao et al.,2019a;Roy and Bhaduri,2021;Roy
et al.,2022;Roy and Bhaduri,2022), time-series classification (Xiao et al.,2021a,c;Xing et al.,
2022a,b), brain-computer interface (Roy,2022b,a,c), and across diverse scientific disciplines
(Zhu et al.,2017;Roy,2021;Bose and Roy,2022). Particularly in object localization, DL
methods have demonstrated superior accuracy (Han et al.,2018) that can be categorized into
two classes: two-stage and one-stage detector (Lin et al.,2017a). Two-stage detectors including
Region Convolution Neural Network (RCNN) (Girshick,2015), faster-RCNN (Ren et al.,2016),
mask-RCNN (He et al.,2017) etc have shown a significant improvement in accuracy in object
localization. In recent times, You Only Look Once (YOLO) variants (Redmon et al.,2016;
Redmon and Farhadi,2017,2018;Bochkovskiy et al.,2020) have been proposed that unify
target classification and localization leading to significant improvement in the detection speed
(Roy et al.,2022;Roy and Bhaduri,2022,2021). Therefore, driven by advances in computer
vision technologies, wildlife detection is rapidly transforming into a data-rich discipline and has
been applied in the automated detection of a variety of wildlife species (Eikelboom et al.,2019;
4
Gon¸calves et al.,2020;Duporge et al.,2021). Along the similar line, various DL methodologies
such as convolutional neural network (CNN) (Kellenberger et al.,2018), RetinaNet (Eikelboom
et al.,2019), ResNet-50 (Chabot et al.,2022), YOLOv3 (Torney et al.,2019), Faster R-CNN
(Peng et al.,2020), Libra-RCNN (Delplanque et al.,2021) etc have demonstrated high precision
in object localization and can be deployed as a reliable and predictable model for automated
wildlife detection.
Motivations : The main motivation of the present study is to design an efficient and robust
computer vision-based algorithm for the accurate classification and localization of endangered
wildlife species. Climatic instability and various human activities such as thawing, hunting,
oil drilling, etc threaten the existence of various endangered animals and create damage to
ecosystems (Jask´olski,2021). Species that inhabit such ecosystems are highly specialized to
live in adverse weather conditions, which is why such changes affect them severely (Crooks
et al.,2017). Thus, it is crucial to build an accurate automated endangered wildlife detection
model to conserve and protect the species and the ecosystem. Although, there exists several
state-of-the-art works for wildlife detection (Barbedo et al.,2019;Naude and Joubert,2019;
Peng et al.,2020;Moreni et al.,2021) including multi-species animal detection (Eikelboom et al.,
2019;Delplanque et al.,2021), however, they often suffer from low accuracy, missed detection,
and relatively large computational overhead. Additionally, there is no systemic study, as per
the authors’ best knowledge, that addresses the challenge of detecting and accurate localization
of multiple endangered wildlife species that is worthy of further investigation. To this end,
the current works aim to develop an efficient and robust endangered wildlife classification and
accurate object localization model simultaneously productive in terms of training time and
computational cost which is currently missing in recent state-of-the-art models for endangered
wildlife detection.
Challenges : Despite illustrating outstanding performance in detecting wildlife species, current
state-of-the-art DL algorithms are still not suitable due to their insufficient fine-grain feature
extraction capability leading to missed detection and false object predictions for endangered
species which posses unique body textures, shapes, sizes, and colors (Kim et al.,2019). Between
5
various species, accurate detection and localization tasks can be challenging due to significant
variability of lightening conditions, low visibility, high degree of osculation and overlap,
the coexistence of multi-object classes with various aspect ratios, and other morphological
characteristics (Chabot et al.,2019). Additionally, visual similarities, complex background
and the low distinguishable interface between species and their surroundings, and various
other critical factors offer additional challenges and difficulties for the state-of-the-art wildlife
detection models (Feng and Li,2022).
To address the aforementioned shortcomings, in the current study, we present WilDect-YOLO,
based on an improved version of the state-of-art YOLOv4 detection model for accurate real-time
endangered wildlife detection. In WilDect-YOLO, we integrate DenseNet blocks to improve
preserving critical feature information and reuse. In addition, two residual blocks have been
carefully designed in the CSPDarknet53 backbone for strong and discriminating deep spatial
features extraction. Furthermore, Spatial Pyramid Pooling (SPP) has been tightly attached
to the backbone to enhance the representation of receptive fields. We have also utilized
a modified Path Aggregation Network (PANet) to efficiently preserve fine-grain localized
information by feature fusion. Additionally, we performed an extensive ablation study for
backbone-neck architecture to optimize both accuracy of detection and detection speed. The
proposed WilDect-YOLO has been employed to detect distinct eight different endangered wildlife
species that provide superior and accurate detection under various complex and challenging
environments. The WilDect-YOLO effectively addresses the shortcoming of existing DL-based
wildlife detection models and illustrates the superior potential in real-time in-field applications.
In short, current work constitutes a step toward a non-invasive, fully automated efficient animal
observation system.
2. Related Works :
In the present section, some recent and relevant works have been highlighted. More recently, a
two-channeled perceiving residual pyramid network (Ruff et al.,2021) has been proposed based
on audio signals that deliver superior detection accuracy. Furthermore, different techniques
such as segmentation-based YOLO model (Parham et al.,2018), fast-depth CNN-based
6
detection model from highly cluttered camera images (Singh et al.,2020), sparse multi
discriminative-neural network (SMD-NN) (Meena and Loganathan,2020), a fast image-enhancement
algorithm based on Multi-Scale Retinex (MSR) (Zotin and Proskurin,2019), CNN-based
model for facial detection (Taheri and
¨
Onsen Toygar,2018), a semi-supervised learning-based
Multi-part CNN (MP-CNN) (Divya Meena and Agilandeeswari,2019), CNN with k-Nearest
Neighbor (kNN) has been utilized for wildlife detection that provides state-of-the-art performance.
In terms of endangered animal detection, there is only a handful of work that has been
geared toward addressing such an important issue. Notably, the DL-based model for classifying
red pandas (He et al.,2019); animal action recognition based on wildlife videos (Schindler and
Steinhage,2021) are some of the representative works in recent endeavors. Additionally, RGB
and thermal image-based Arctic bird detection using drones has been developed in (Lee et al.,
2019). After reviewing the aforementioned methods which are geared towards endangered
wildlife detection, the current works aim to develop an efficient and robust endangered wildlife
classification and accurate object localization model simultaneously productive in terms of
training time and computational cost which is currently lacking in the recent state-of-the-art
endeavors.
3. Endangered wildlife species dataset :
Since there is no publicly available endangered wildlife dataset, in the present work, we
have extensively collected high-resolution web-harvested images for different endangered species
under various complex backgrounds. The dataset used for the experimentation comprises
eight classes: Polar Bear (Ursus maritimus) , Gal´apagos Penguin (Spheniscus mendiculus),
Giant Panda (Ailuropoda melanoleuca), Red Panda (Ailurus fulgens), African forest elephant
(Loxodonta cyclotis), Sunda Tiger (Panthera tigris sondaica), Black Rhino (Diceros bicornis),
and African wild Dog (Lycaon pictus). Fig. 1shows some of the representative images from
the custom dataset for the eight different classes considered herein. Noteworthy to mention,
categories including Gal´apagos Penguin, Red Panda, African forest elephant, Sunda Tiger,
Black Rhino and African wild Dogs have been declared critically endangered species. In the
datasets, there are a total number of 1600 images of which there are 200 images for each class.
7
Figure 1: (a) Representative samples images from endangered wildlife dataset that consist of
eight classes: (a) Polar Bear; (b) Gal´apagos Penguin; (c) Giant Panda; (d) Red Panda; (e)
African forest elephant; (f) Sunda Tiger; (g) Black Rhino; and (h) African wild Dog
For the variability and challenges in the datasets, we have included images that characterize
limited and/or full illumination, low visibility, high degree of occultation, multiple objects
with overlap, complex backgrounds, the textural similarity of the object and the background,
and noisy environment. Additionally, the images of the dataset have variations in their scale,
orientation, and resolution.
4. Proposed Methodology for object localization:
In object detection, the target object classification and localization are performed simultaneously
where the target class has been categorized and separated from the background by drawing
bounding boxes (BBs) on input images containing the entire object. This can be particularly
useful for counting endangered species for accurate surveying. To this end, the main goal of the
current work is to develop an accurate and robust endangered wildlife localization model. In
this regard, different variants of YOLO (Redmon et al.,2016;Redmon and Farhadi,2017,2018;
8
η
d
wp
hp
wgt
hgt
(a)
Wildlife detection
Input N ×N grids
BBs+ confidence score
Class probability
Figure 2: Schematic of (a) YOLO object localization process for endangered wildlife detection;
(b) offset regression process for target BBs prediction during CIoU loss.
Bochkovskiy et al.,2020) are some of the best high-precision one-stage object detection models
that consist of the following parts: a backbone for semantic deep feature extraction, followed by
the neck for hierarchical feature fusion, and finally detection head for object classification and
localization. The overall schematic of the YOLO object localization process has been depicted
in Fig. 2where the YOLO algorithm transforms the object detection task into a regression
problem by generating BBs coordinates and probabilities for each class. During the process,
the inputted image size has been uniformly divided into
N×N
grids where
B
predictive BBs
have been generated. Subsequently, a confidence score has been assigned if the target object
falls inside that particular grid. It detects the target object for a particular class when the
center of the ground truth lies inside a specified grid. During detection, each grid predicts
NB
numbers of BBs with the confidence value ΘBas:
ΘB=Pr(obj)×IoUt
p Pr(obj)0,1 (1)
where
Pr
(
obj
) infers the accuracy of BB prediction, i.e.,
Pr
(
obj
) = 1 indicates that the target
class falls inside the grid, otherwise,
Pr
(
obj
) = 0. The degree of overlap between ground truth
and the predicted BB has been described by the scale-invariant evaluation metric intersection
9
over union (IoU) which can be expressed as
IoU = BpBt
BpBt
(2)
where B
t
and B
p
are the ground truth and predicted BBs, respectively. However, to further
improve BBs regression and gradient disappearance, generalized IoU (GIoU) (Rezatofighi et al.,
2019) and distance-IoU (DIoU) (Zheng et al.,2020) as been introduced considering aspect ratios
and orientation of the overlapping BBs. More recently, complete IoU (CIoU) (Zheng et al.,
2020) has been proposed for improved accuracy and faster convergence speed in BB prediction
which can be expressed as
LCIoU = 1 + βξ +α2(bp,bt)
η2IoU (3)
ξ=4
π2tan1wt
ht
tan1wp
hp2
;β=ξ
(1 IoU) + ξ0(4)
where
bgt
and b
p
denotes the centroids of B
gt
and B
p
, respectively;
ξ
and
β
are the consistency
and trade-off parameters, respectively. As shown in Fig. 2-(b),
η
is the smallest diagonal
length of B
p
B
t
;
wgt
,
wp
are widths and
hgt
,
hp
are heights of B
gt
and B
p
, respectively.
With increasing
wp/hp
, we get
ξ
0 from Eq. 4. Therefore, to optimize the influence of
ξ
on
the CIoU,
wp/hp
can be properly chosen for the YOLO model. Finally, the best BB prediction
can be obtained from the non-maximum suppression (NMS) (Ren et al.,2016) algorithm from
multiple scales.
4.1 WilDect-YOLO architecture:
In recent endeavors, various attempts have been made on computer vision-based object detection
algorithm for accurate wildlife detection and survey utilizing deep CNN (Kellenberger et al.,
2019), R-CNN (Ibraheam et al.,2021), Faster R-CNN (Peng et al.,2020), single shot multi-box
detector (SSD) (Saxena et al.,2021), and YOLO (Choe and Kim,2020). Although the
aforementioned techniques have demonstrated outstanding performance, however, the detection
of endangered wildlife detection task, specifically in Polar and African regions, faces several
10
specific challenges, in particular, significant variability of lightening conditions, low visibility,
high degree of osculation and overlap, the coexistence of multiple target classes with various
aspect ratios, visual similarities, complex backgrounds, and the low distinguishable interface
between species and its surroundings. Such challenging conditions lead to false object prediction
with a large number of missed detection from the original YOLOv4 (Bochkovskiy et al.,2020)
due to its insufficient fine-grain feature extraction capabilities.
To resolve the existing issues, in the current work, we propose a novel object localization
algorithm WilDect-YOLO based on a state-of-the-art YOLOv4 network, specially designed
for endangered wildlife detection, to enhance feature extraction, preserve fine-grain localized
information and improve feature fusion that provides superior detection under various challenging
environments. The model has been optimized to achieve better efficiency and accuracy of BB
prediction based on the characteristics and complexities of the endangered wildlife dataset
considered herein. The overall network of the object localization model is shown in Fig.
3. To improve performance in terms of classification accuracy and object localization, we
Input: (416, 416, 3)
Down sample:
Up sample:
Concatenate:
DS
13×13×24
26×26×24
52×52×24
Detection
US
Class Loss
CIoU Loss
Confidence Loss
C
CSPX2×3
C
CSPX2-3
Dense-CSPDarknet53
CSPX2×3
Modified PANet
Head
DS
DS
CSP1
CSP2
CSP8
D-CSPX1-4
D-CSPX1-2
CBH
Conv2D
CBH
Conv2D
CBH
CBH
C
CBH
C
CBH
US
CBH
CBH
US
C
CBH
Conv2D
CBH
CSP1
CSP2
CSP4
CSPX1-3
CSPX1×3
Dense B-2
CSPX1-2
CSPX1-4
CSPX2-3
CSPX2×3
CSPX2-3
CSPX2×3
C
52×52×24
26×26×24
CSPX2-3
CBL
SPP
MaxPool (5)
MaxPool (9)
MaxPool (13)
CBH
Dense B-1
CBH
CSP8
CSP2
CSP1
Figure 3: Schematic of the proposed WilDect-YOLO consists of improved Dense-CSPDarknet53
with residual block CSPX1-
n
and SPP in the backbone, modified PANet in the neck part with
regular YOLO head.
11
perform extensive experiments, and various modifications are proposed which are detailed in
the subsequent sections.
4.2 Improvement of discriminative feature extraction:
In the present study, we have introduced a residual block CSPX1-
n
where
n
represents
residual weighting operations to improve detection speed and performance. We integrate
CSPX1-
n
modules in the CSPDarknet53 backbone replacing the original CSP8 and CSP4
residual blocks to extract fine-grained rich semantic information as shown in Fig. 3. In the
CSPX1-
n
block, we divide the input features into two parts. In the first part, (3
×
3) convolution
was performed followed by an additional (3
×
3) convolution to maintain the number of feature
maps after entering the next residual unit as shown in Fig. 4-(a). To further improve the
feature extraction, we perform 3
×
3 convolution at the end. Whereas, the second part acts as
a residual edge for the convolution. These two parts have been concatenated at the end to
improve the semantic feature information. Implementation of the CSPX1-
n
modules in the
improved CSPDarknet53 helps to learn more expressive features that demonstrate significant
improvement of detection accuracy for the custom wildlife datasets used herein.
4.3 Preserving critical feature information:
To preserve critical feature maps and efficiently reuse the discriminative feature information,
we have fused DenseNet (Huang et al.,2017) in the original CSPDarknet53. In DenseNet,
each layer has been connected to other layers in a feed-forward mode where
n
-th layer can
receive the important feature information
Xn
from all the previous layers
X0, X1, ..., Xn1
as
Xn
=
Hn
[
X0, X1, ..., Xn1
] where
Hn
is the feature map function for
n
-th layer. The schematic
of the DenseNet blocks network structure have been shown in Fig. 4-(b, c). As shown in Fig. 3,
we have introduced two DenseNet blocks; the first block (Dense B-1) has been attached before
cross-stage partial block CSPX1-4; whereas the second block (Dense B-2) has been placed
before CSPX1-2 in the proposed WilDect-YOLO network which results in enhance feature
propagation. It has been found that DenseNet significantly improves the feature transfer and
12
26×26×24
Res Unit
CSPX2-n
(Res Unit) ×n
Part I
Part II
CSPX2-n Block
C
CBH
BN
L-ReLU
Conv2D
Conv2D
X1
H1
Input: (26×26×256)
Output:
( 26×26×512)
X0
X2
X3
X4
H2
H3
H4
X1
Transition
Layer
( 26×26×320)
( 26×26×384)
( 26×26×448)
Transition
Layer
Output:
( 13×13×1024)
( 13×13×896)
( 13×13×768)
Input: (13×13×512)
( 13×13×640)
X4
X3
X2
X1
X0
H4
H3
H2
H1
Dense Block -1
Dense Block -2
(a)
(b)
Res Unit
CSPX1-n
(CBH) ×3
Part I
Part II
CSPX1-n Block
C
CBH
CBH
Conv2D
Conv2D
BN
L-ReLU
CBH
(b)
(c)
CBH
(d)
Figure 4: Schematic of (a) CSPX1-
n
residual block; (b) dense block (DB)-1; (c) dense block
(DB)-2; (d) CSPX2-nresidual block architecture used in WilDect-YOLO detection model.
13
mitigates over-fitting in the proposed detection network. Additionally, by reducing redundant
feature operations, such implementation improve the computational speed.
4.4 Receptive field enhancement:
One of the requirements of CNN is to have fixed-size input images. However, due to the
different aspect ratios of the images, they have been fixed by cropping and warping during the
convolution process which results in losing important features. In this regard, SPP (He et al.,
2015) applies an efficient strategy in detecting target objects at multiple length scales. To
this end, we have added an SPP block integrated with CSPX1-2 of the Dense-CSPDarknet53
backbone to improve receptive field representation and extraction of important contextual
features as shown in Fig. 4. In the proposed model, a modified SPP consisting of various sizes
of sliding kernels (i.e., 5
×
5, 9
×
9, and 13
×
13 ) with maximum pooling has been prescribed
that effectively increases the receptive field representation of the backbone.
4.5 Preserving fine-grain localize information:
In addition, an improved PANet (Liu et al.,2018) integrated with CSPX2-
n
has been utilized
as a neck of the detection model as shown in Fig. 2. It can efficiently combine high and low
feature fusion for multi-scale feature pyramid maps preserving fine-grain localized information.
Additionally, by employing flexible ROI pooling and element-wise max operation, PANet can
efficiently fuse the information from previous feature layers resulting in significant improvement
in the detection accuracy of the model.
Furthermore, CIoU loss function (Zheng et al.,2020), dropblock regularization (Ghiasi et al.,
2018), Cross mini Batch Normalization (Yao et al.,2021), dropout in feature map (Srivastava
et al.,2014), and cosine annealing scheduler (Loshchilov and Hutter,2017) have been employed
to further improve the performance of WilDect-YOLO. We use the original YOLOv3 head in
the final part of the detection network. Utilizing 416
×
416
×
3 image size as the input, the
detection head of the WilDect-YOLO can predict BBs in three different scales: (13
×
13
×
24),
(26
×
26
×
24), and (52
×
52
×
24) as shown in Fig. 2. After extensive experiments, we have
14
found that Mish (Misra,2020) activation provides the optimal performance in terms of model
accuracy. Overall, our proposed methodology provides the best results in terms of accuracy
and performance compared to current state-of-the-art models for endangered wildlife detection
(see Section 6.2 )
5. Training and performance :
5.1 Training procedure :
In the present work, we have performed an extensive and elaborate study to explore the
comparative performance analysis of the proposed WilDect-YOLO models for endangered
wildlife classification and object localization. From the initial custom endangered wildlife
species dataset consisting of 1,600 images has been further expanded tenfold by utilizing various
data augmentation procedures (i.e., color balancing, rotation, blur processing, mirror projection,
brightness transformation) to obtain the final dataset of a total of 16,000 images (2,000 images
per class). From the final dataset, a total of 60%, 20%, and 20% images have been randomly
chosen for training, validation, and test sets, respectively. For the training set, LabelImg
(Tzutalin,2015) has been used for the annotation of BBs around the target classes. For all
the experiments, we have used a Windows 10 Pro (64-bit) based computational system that
has Intel Core i5-10210U with CPU @ 2.8 GHz
×
6, 32 GB DDR4 memory, NVIDIA GeForce
RTX 2080 utilizing CUDA 10.2.89 and cuDNN 10.2 v7.6.5 for GPU parallelization. As required
CV libraries, Visual Studio v15.9 (2017), and OpenCV 4.5.1-vc14 have been integrated with
DarkNet. Unless otherwise stated, a batch size set to 32 with a total number of training steps
has been kept as 85,000 during training. The initial learning rate has been set to 0.001. The
training dataset has been trained utilizing the available pre-trained weights-file (AlexeyAB,
2021). Various training hyperparameters for WilDect-YOLO have been detailed in Table 1.
5.2 Performance metrics:
In the present work, the performance of the object detection models has been evaluated
15
Table 1: Various hyparameters values for training the WilDect-YOLOv model
Image size Sub-division Batch Channels Decay
416 ×416 ×3 8 32 6 0.005
Initial learning rate Momentum Classes Training steps Filters
0.001 0.9 8 85,000 36
by common standard measures (Ferri et al.,2009) including average precision (AP), precision
(P), recall (R), IoU, F-1 score, mean average precision (mAP), etc. The confusion matrix
obtained from the evaluation procedure provides the following interpretations of the test results:
true positive (TP), false positive (FP), false negative (FN), and true negative (TN). During
binary classification, the classified object can be defined as TP for IoU
0
.
5. Whereas, it can
be classified as FP for IoU
<
0
.
5. Based on the aforementioned interpretations, the metric P of
the classifier can be defined by its ability to distinguish target classes correctly as :
P=T P
(T P +F P ); (5)
The ratio of the correct prediction of target classes is called R of the classifier which can be
evaluated as:
R=T P
(T P +F N )(6)
The higher values of P and R indicate superior detection capability. Whereas, the F-1 score is
the arithmetic mean of the P and R given as :
F1score = 2P×R
P+R.(7)
A relatively high F1 score represents a robust detection model. The performance metrics AP
can be defined as the area under a P-R curve (Davis and Goadrich,2006) as follows
AP =Z1
0
P(R) dR. (8)
16
A higher average AP value indicates better accuracy in predicting various object classes. In
addition,
AP50:95
denotes AP over IoU=0
.
50 : 0
.
05 : 0
.
95; AP
50
and AP
75
are APs at IoU
threshold of 50% and 75%, respectively. The AP for detecting small, medium, and large objects
can be measured through AP
S
, AP
M
, and AP
L
, respectively. Finally, mAP can be obtained
from the average of all APs as:
mAP =1
Nc
N
X
i=1
APi.(9)
6. Results:
In this section, the performance and detection accuracy of the proposed WilDect-YOLO
frameworks have been discussed which have been evaluated in a custom-made endangered
wildlife dataset consisting of 8 classes. For better clarity in BBs representation, the following
BB class identifiers have been associated in the detection results: class 1- Polar Bear; class 2-
Gal´apagos Penguin; class 3- Giant Panda; class 4- Red Panda; class 5- African forest elephant;
class 6- Sunda Tiger; class 7- Black Rhino; and class 8- African wild Dog. The performance of
the WilDect-YOLO network has been optimized through extensive ablation studies. Finally,
the performance of the proposed model has been studied in detail and compared with several
state-of-the-art object detection models.
6.1 Optimization of network performance:
At first, we conduct extensive experiments to select proper backbone-neck combinations
to optimize the performance of the proposed WilDect-YOLO model in terms of both detection
accuracy and speed. For different combinations of backbone-neck configurations, detection
accuracy in terms of parameters AP, AP
50
, AP
75
, AP
S
, AP
M
, and AP
L
as well as detection
speed (in FPS) has been reported in Table. 2. For the comparison, we select Mish as the
activation function. From the Table. 2, one can see that DenseNet blocks in CSPDarknet53
(i.e., D-CSPDarknet-53) improve the accuracy of the detection model compared to the original
17
Table 2: Performance of various residual and dense block combinations in WilDect-YOLO
architecture for anchors size of 416 ×416.
Backbone
+ add-in
Neck
+add-in
AP AP50 AP75 APSAPMAPLFPS
CSPDarknet53 PANet 76.8 93.6 92.5 80.9 89.2 80.9 59.6
D-CSPDarknet53 PANet 78.4 96.1 92.2 78.3 87.7 81.7 61.1
D-CSPDarknet53+CSPX1-nPANet 79.5 96.1 92.5 77.9 88.2 82.9 60.1
CSPDarknet53 PANet+CSPX2-n77.1 95.6 91.2 74.1 87.9 84.7 63.2
D-CSPDarknet53+CSPX1-nPANet+CSPX2-n81.7 96.9 92.3 87.8 92.5 88.5 59.2
YOLOv4. The performance is further improved by introducing CSPX1-
n
into D-CSPDarknet53.
However, such a configuration results in a slight decrease in detection speed. We observe
that the best performance has been achieved when both CSPX1-
n
and CSPX2-
n
have been
integrated into D-CSPDarknet53 and PANet, respectively. There is a significant improvement
in the accuracy parameter, in particular, AP, AP
S
, and AP
L
increase by 4.9%, 6.9%, and 7.6%,
respectively compared to CSPDarknet53+PANet configuration. Thus, a such configuration
in WilDect-YOLO provides the optimal performance in terms of detection accuracy and
speed for the custom wildlife species data set considered herein. In summary, together with
proper activation function and improved backbone-neck combination provide an efficient
high-performance model for wildlife detection in complex scenarios.
6.2 Comparison with existing state-of-the-art models:
In this section, the detection performance of WilDect-YOLO is compared with some of the
existing state-of-the-art detection models (Zhao et al.,2019b). For the performance comparison,
we consider Faster R-CNN (Ren et al.,2016), Mask R-CNN He et al. (2017), RetinaNet
(Lin et al.,2017b), SSD Liu et al. (2016), YOLOv3 (Redmon and Farhadi,2018), YOLOv4
(Bochkovskiy et al.,2020), and Dense-YOLOv4 (Roy and Bhaduri,2022) that are trained
in the custom wildlife dataset in OpenMMLab object detection toolbox Chen et al. (2019).
Comparison of different performance parameters including P, R, F1-score, mAP, and detection
18
Table 3: Comparison of different performance parameters including P, R, F1, mAP, and
detection speed (in FPS) between WilDect-YOLO and other state-of-the-art models where bold
highlights the best performance values.
Model P (%) R (%) F1-score (%) mAP (%) Dect. time (ms) FPS
Faster R-CNN 71.32 72.39 71.85 73.17 41.12 24.32
RetinaNet 75.11 77.67 76.36 77.11 32.89 30.40
SSD 76.13 80.19 78.10 80.52 28.22 35.43
Mask R-CNN 78.22 83.35 80.70 81.61 50.72 19.72
YOLOv3 83.61 87.47 85.49 86.61 25.11 39.82
YOLOv4 90.19 93.79 91.95 91.29 17.21 58.10
Dense-YOLOv4 93.53 96.42 94.95 93.61 16.77 59.63
WilDect-YOLO 97.18 98.56 97.87 96.89 16.89 59.20
speed obtained from these models have been shown in Table 3. The comparison reveals
that the accuracy of R-CNN, RetinaNet, SSD, and Mask R-CNN is quite inferior compared
to YOLO variants as visually illustrated in the bar-chart plot in Fig. 5. Between YOLOv3
and YOLOv4, YOLOv4 demonstrated better performance with a 6
.
46% increase in F1 and
4
.
68% increase in mAP, respectively. We observe that the performance of Dense-YOLOv4 is
superior to the original YOLOv4 with 3
.
34%, 2
.
63%, 3
.
01%, and 2
.
32% increase in P, R, F1,
and mAP, respectively. However, WilDect-YOLO yields the best performance reaching the
values of 97
.
18%, 98
.
56%, 97
.
87%, and 96
.
89% in P, R, F1, and mAP, respectively as shown
in Fig.5. Moreover, WilDect-YOLO provides a superior real-time detection speed of 59.21
FPS which is 3
.
34% higher than the original YOLOv4 model. In summary, WilDect-YOLO
outshines some of the best detection models in terms of both detection accuracy and speed
suitable for automated high-performance wildlife detection models.
6.3 Overall performance of WilDect-YOLO:
From the previous section, it has been observed that YOLOv4, Dense-YOLOv4, and WilDect-YOLO
provide better performance compared to other state-of-the-art models. Therefore, these three
19
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5
F R-CNN
RN
SSD
M R-CNN
Y3
Yv4
D-Yv4
WD-Y
Figure 5: Comparison bar chart of different performance parameters including P, R, F1-score,
mAP, and detection speed (in FPS) between WilDect-YOLO and other state-of-the-art models.
Table 4: Overall performance comparison between original YOLOv4, Dense-YOLOv4, and
WilDect-YOLO.
Detection model IoU F1 mAP Validation loss Detection
time (ms)
Detection
speed (FPS)
YOLOv4 0.810 0.919 0.913 12.07 17.21 58.10
Dense-YOLOv4 0.881 0.949 0.936 5.31 16.77 59.63
WilDect-YOLO 0.917 0.979 0.969 1.88 16.89 59.21
models are closely compared in terms of mAP, F1, IoU, final loss, and average detection
time as shown in Table 4. The proposed WilDect-YOLOv has achieved the highest average
IoU value of 0.917 indicating superior BB accuracy during target detection compared to the
other two models. Similarly, it has also illustrated better detection performance and accuracy
by achieving the highest F1 and mAP values of 97
.
9% and 96
.
9% which are 6
.
1% and 5
.
6%
improvement over the original YOLOv4, respectively. Furthermore, the detection speed of
59.21 FPS obtained from WilDect-YOLO was found to be higher than YOLO and slightly less
than Dense-YOLOv4. Thus, it can provide real-time detection of wildlife species with better
accuracy compared to the other two models. In addition, the comparison of P-R curves between
the three models have been depicted in Fig 6-(a). From the comparison of the P-R curves, one
20
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1
YOLOv3
YOLOv4
Improved YOLOv4
0
20
40
60
80
100
120
140
0 1 2 3 4 5 6 7 8 9
yolov3
YOLOv4
Improved YOLOv4
Figure 6: Comparison of (a) P-R curves; (b) loss evolution curves between original YOLOv4,
Dense-YOLOv4, and WilDect-YOLO.
can see that WilDect-YOLO attains a better P value for a particular R. It achieved the highest
area under the P-R curve indicating superior detection performance compared to YOLOv4
and Dense-YOLOv4. Next, we compare the loss evolution curves as shown in Fig 6-(b). In
the initial phase, after exhibiting several cycles of fluctuation, the loss in the WilDect-YOLO
model tends to saturate after approximately 20,000 training steps with a final loss value of
1.88. Whereas, the other two models exhibit higher fluctuation in loss evolution and yield
higher final loss value. Evidently, the proposed WilDect-YOLO is easier to train with faster
convergence characteristics demonstrating its efficacy from the computational point of view.
To further gain insight into the performances of these models, detection result containing
TP, FP, and FN for each class and corresponding P, R, and F-1 values from Dense-YOLOv4
and WilDect-YOLO has been shown in Table 5. WilDect-YOLO has illustrated significant
improvement in P and R values for various classes, in particular, for detecting Galapagoes
Penguine, African Elephant, and Black Rhino classes. WilDect efficiently maximizes the TP
value while simultaneously reducing FP and FN values for all classes. The proposed model
improves 3
.
65% in P and 2
.
14% in R compared to Dense-YOLOv4. From the overall comparison,
we can conclude that WilDect-YOLO demonstrated the best performance in detecting various
endangered wildlife species outperforming both YOLOv4 and Dense-YOLOv4 in terms of
21
Table 5: Comparison of detection results for individual classes between Dense-YOLOv4 and
WilDect-YOLO
Model Class Objects TP FP FN P (%) R (%) F1-score
WilDect-YOLO
All 10070 9694 281 141 97.18 98.56 97.87
Polar Bear 675 656 12 8 98.20 98.79 98.49
Galap. Penguine 1453 1398 33 27 97.69 98.10 97.89
Giant Panda 1211 1201 23 11 98.12 99.09 98.60
Red Panda 789 756 12 09 98.43 98.82 98.63
African Elephant 1987 1878 89 32 95.47 98.32 96.87
Sunda Tiger 987 923 44 19 95.44 97.98 96.70
Black Rhino 1001 981 23 12 97.70 98.79 98.24
Wild Dog 1967 1901 45 23 97.68 98.80 98.24
Dense-YOLO4
All 10070 9291 642 345 93.53 96.42 94.95
Polar Bear 675 621 37 22 94.37 96.58 95.46
Galap. Penguine 1453 1378 39 32 97.24 97.73 97.48
Giant Panda 1211 1118 87 52 92.78 95.56 94.14
Red Panda 789 740 29 26 96.22 96.60 96.41
African Elephant 1987 1801 177 78 91.05 95.84 93.38
Sunda Tiger 987 901 76 32 92.22 96.57 94.34
Black Rhino 1001 921 87 36 91.36 96.23 93.74
Wild Dog 1967 1811 110 67 94.27 96.43 95.34
precision and accuracy values.
6.4 Detection of various animal species:
In this section, we have demonstrated the detection results for eight different classes of
endangered animal species from the proposed WilDect-YOLO and compared them with
Dense-YOLOv4. The visual representations of the detection results have been presented
with confined BBs considering complex backgrounds and challenging environments as shown
in Figs. 7-10. Corresponding detailed detection results consisting of the number of detected
and undetected target classes with average confidence scores have been reported in Table.
6. In Fig. 7, we tested the model for detecting Polar Bears and Galapagos Penguins in a
challenging scenario where the target objects have been placed in a similar textured background.
The proposed model shows its efficacy by preciously detecting the target objects with high
average confidence index values. In separate cases, we have considered detection for Giant
Panda and Red Panda classes where multiple target objects have a significant degree of overlap
22
12
1
2
1111
1
1
2
2
22
2
2
11
1
1
11
22
1
11
1
11
22 2 2
222
2 2
2
22
2
222
2
Figure 7: Detection results for Polar Bear (class-1) and Galapagos Penguin (class-2) from
the proposed WilDectYOLO and Dense-YOLOv4. Detailed detection results with average
confidence indexes have been shown in Table 6.
23
Table 6: Detailed detection results from WilDect-YOLO and Dense-YOLOv4 for different
classes as shown in Figs. 7-10.
Species Figs. No Model Detc. Undetc. Avg. confidence Score
Polar Bear 7(a)-(c) WilDect-YOLO 10 0 0.96
Polar Bear 7(a-i)-(c-i) Dense-YOLOv4 10 0 0.91
Galap. Penguine 7(d)-(f) WilDect-YOLO 16 0 0.93
Galap. Penguine 7(d-i)-(f-i) Dense-YOLOv4 13 3 0.88
Giant Panda 8(a)-(c) WilDect-YOLO 18 1 0.94
Giant Panda 8(a-i)-(c-i) Dense-YOLOv4 14 5 0.83
Red Panda 8(d)-(f) WilDect-YOLO 7 0 0.98
Red Panda 8(d-i)-(f-i) Dense-YOLOv4 7 0 0.93
African Elephant 9(a)-(c) WilDect-YOLO 14 0 0.92
African Elephant 9(a-i)-(c-i) Dense-YOLOv4 10 4 0.83
Sunda Tiger 9(d)-(f) WilDect-YOLO 8 0 0.99
Sunda Tiger 9(d-i)-(f-i) Dense-YOLOv4 7 1 0.92
Black Rhino 10 (a)-(c) WilDect-YOLO 7 0 0.98
Black Rhino 10 (a-i)-(c-i) Dense-YOLOv4 6 1 0.91
Wild Dog 10 (d)-(f) WilDect-YOLO 16 0 0.91
Wild Dog 10 (d-i)-(f-i) Dense-YOLOv4 11 5 0.77
24
4
33
3
333
3
33
33
3
33
3
33
3
33
33
333
3
3
3
44
4
44
444
4
Figure 8: Detection results for Giant Panda (class-3) and Red Panda (class-4) from the proposed
WilDectYOLO and Dense-YOLOv4. Detailed detection results with average confidence scores
have been shown in Table 6.
25
55
5555
5
55
5
5
555
55
5
5
5
5555
66
6
66
666
6
66
66
Figure 9: Detection results for African Elephant (class-5) and Sunda Tiger (class-6) from
the proposed WilDectYOLO and Dense-YOLOv4. Detailed detection results with average
confidence scores have been shown in Table 6.
between them. From the detection result, one can see that the bounding box prediction from
the proposed WilDect-YOLO is quite accurate in detecting each target object as illustrated
in Fig. 8. In Fig. 9, we have extended the detection for African Elephant and Sunda Tiger
classes where the target class is placed in a complex and challenging background. Detection
results from WilDect-YOLO in terms of boundary box precision are more accurate compared to
Dense-YOLOv4 as shown in Table. 6. To further illustrate the efficacy of the WilDect-YOLO
detection performance, we have considered the detection of Black Rhino and African Wild Dog
cases that have a high degree of occlusion, and dense overlapping between object classes. This
is quite a challenging task to detect target objects individually. In such cases, the detection
results from WilDect-YOLO elucidate superior detection accuracy by preciously detecting each
target class with high confidence index as shown in Figs. 10.
Additionally, for poorly visible multiple target objects due to insufficient lightening conditions,
the proposed localization algorithm performs well without missed detection as demonstrated in
Figs. 7-10. For high-aspect-ratio object detection cases with the presence of irregular shapes and
the similarity of their texture with surrounding environments, the proposed the model yields
26
7
7
7
7
7
88
8
8
888
8
8
7
7
7
7
77888
7
7
88
88
88888
8
Figure 10: Detection results for Black Rhino (class-7) and African wild Dog (class-8) from
the proposed WilDectYOLO and Dense-YOLOv4. Detailed detection results with average
confidence scores have been shown in Table 6.
good performance in such challenging scenarios. The overall detection result illustrates accurate
and robust bounding box prediction from WilDect-YOLO for all target classes compared to
Dense-YOLOv4.
7. Discussion :
The current study proposes an efficient automated detection framework for the endangered
wildlife species which can be deployed for animal surveys in various demographic regions without
human intervention. Thus, it can significantly reduce the cost of operation, manual equipment,
and overcome the difficulties of working in these adverse weather conditions. The current
framework illustrates its superior capability of detecting various endangered animals which
are significantly different in terms of body textures, shapes, sizes, colors, and morphological
characteristics. Furthermore, in the presence of various detection challenges such as visual
similarities, complex backgrounds, a high degree of occultation and overlap, and the low
27
distinguishable interface between species and its surroundings, the proposed model can replace
current state-of-the-art detection models in terms of accuracy and robustness. Additionally,
the current deep learning framework can be extended to UAS imagery to further expand the
capability of detecting various wildlife animals. With improved feature extraction capability and
an efficient localization algorithm, the proposed model can be suitable for detecting small-size
animals from relatively low-resolution images as well as satellite imagery. Although, the present
work focus on endangered animal detection, however, the current framework can be extended to
more generalized automated animal species detection for comprehensive and systematic wildlife
animal surveys. Furthermore, the current work can be integrated with geographic information
systems (GIS) for analyzing the migrations and activities of wild animals. Moreover, one
of the potential applications can be assembling object detection framework with semantic
segmentation methods such as Mask R-CNN (Bharati and Pramanik,2020), U-Net (Esser et al.,
2018) to extract additional physical information such as diseases, body fat, height as well as
various animal activities including eating, running, and resting which can be helpful in better
understanding animal health and habits (Norouzzadeh et al.,2018). Nevertheless, the current
deep-learning model outshines classical automated image analysis and various state-of-the-art
approaches in wildlife animal detection indicating future improvements in performance and
usability for the precise and accurate endangered animal survey which can be applied to
various automated wildlife monitoring (Desgarnier et al.,2022;Hou et al.,2020;Chen et al.,
2020;Mannocci et al.,2021;Arbieu et al.,2021) and different biological conservation purposes
(Stern and Humphries,2022). The current framework can also be extended for various fault
detection/thermal imaging(Glowacz,2021b,a,c), human activity recognition (Xiao et al.,2021b)
etc.
8. Conclusions :
Summarizing, in the present work, we have developed an efficient and robust object localization
algorithm WilDect-YOLO is based on computer vision for accurate classification and localization
of various endangered wildlife species. In the proposed network, we integrate DenseNet blocks
to improve feature critical feature information and two new residual blocks for efficient deep
28
spatial feature extraction. In addition, SPP and improved PANet modules have been employed
to efficiently preserve fine-grain localized information by feature fusion. Evaluated on a
custom-made dataset for endangered wildlife species, it has been found that at a detection rate
of 59.20 FPS, WilDect-YOLO has achieved mAP, F1-score, and precision values of 96
.
89%,
97
.
87%, and 97
.
18%, respectively outperforms existing state-of-the-art wildlife detection models
in terms of both classification accuracy and localized bounding box prediction in detecting
various wildlife spices. Current work effectively addresses the shortcoming of existing deep
learning-based wildlife detection models and constitutes a step toward a fully automated
accurate automated wildlife monitoring system in real-time in-field applications.
Acknowledgements: The support of the Aeronautical Research and Development Board
(Grant No. DARO/08/1051450/M/I) is gratefully acknowledged.
Conflict of interest: The authors declare that they have no known competing financial
interests or personal relationships that could have appeared to influence the work reported in
this paper.
Data availability: The data that support the findings of this study are available upon
reasonable request.
29
References
Aebischer, T., Siguindo, G., Rochat, E., Arandjelovic, M., Heilman, A., Hickisch, R., Vigilant,
L., Joost, S., and Wegmann, D. (2017). First quantitative survey delineates the distribution
of chimpanzees in the eastern central african republic. Biological Conservation, 213:84--94.
AlexeyAB (2021). Pre-trained weights-file.
Arbieu, U., Helsper, K., Dadvar, M., Mueller, T., and Niamir, A. (2021). Natural language
processing as a tool to evaluate emotions in conservation conflicts. Biological Conservation,
256:109030.
Austrheim, G., Speed, J. D., Martinsen, V., Mulder, J., and Mysterud, A. (2014). Experimental
effects of herbivore density on aboveground plant biomass in an alpine grassland ecosystem.
Arctic, Antarctic, and Alpine Research, 46(3):535--541.
Barbedo, J. G. A., Koenigkan, L. V., Santos, T. T., and Santos, P. M. (2019). A study on the
detection of cattle in uav images using deep learning. Sensors, 19(24):5436.
Bharati, P. and Pramanik, A. (2020). Deep learning techniques—r-cnn to mask r-cnn: a survey.
Computational Intelligence in Pattern Recognition, pages 657--668.
Bochkovskiy, A., Wang, C.-Y., and Liao, H.-Y. M. (2020). Yolov4: Optimal speed and accuracy
of object detection.
Bose, R. and Roy, A. (2022). Accurate deep learning sub-grid scale models for large eddy
simulations. Bulletin of the American Physical Society.
Chabot, D., Stapleton, S., and Francis, C. M. (2019). Measuring the spectral signature of
polar bears from a drone to improve their detection from space. Biological Conservation,
237:125--132.
Chabot, D., Stapleton, S., and Francis, C. M. (2022). Using web images to train a deep neural
network to detect sparsely distributed wildlife in large volumes of remotely sensed imagery:
A case study of polar bears on sea ice. Ecological Informatics, page 101547.
30
Chalmers, C., Fergus, P., Curbelo Montanez, C. A., Longmore, S. N., and Wich, S. A.
(2021). Video analysis for the detection of animals using convolutional neural networks and
consumer-grade drones. Journal of Unmanned Vehicle Systems, 9(2):112--127.
Chandio, A., Gui, G., Kumar, T., Ullah, I., Ranjbarzadeh, R., Roy, A. M., Hussain, A., and
Shen, Y. (2022). Precise single-stage detector. arXiv preprint arXiv:2210.04252.
Chauvenet, A. L., Gill, R. M., Smith, G. C., Ward, A. I., and Massei, G. (2017). Quantifying
the bias in density estimated from distance sampling and camera trapping of unmarked
individuals. Ecological Modelling, 350:79--86.
Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J.,
Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J.,
Wang, J., Shi, J., Ouyang, W., Loy, C. C., and Lin, D. (2019). MMDetection: Open mmlab
detection toolbox and benchmark. arXiv preprint arXiv:1906.07155.
Chen, X., Zhao, J., Chen, Y.-h., Zhou, W., and Hughes, A. C. (2020). Automatic standardized
processing and identification of tropical bat calls using deep learning approaches. Biological
Conservation, 241:108269.
Cheng, G. and Han, J. (2016). A survey on object detection in optical remote sensing images.
ISPRS Journal of Photogrammetry and Remote Sensing, 117:11--28.
Choe, D.-G. and Kim, D.-K. (2020). Deep learning-based image data processing and archival
system for object detection of endangered species. Journal of information and communication
convergence engineering, 18(4):267--277.
Crooks, K., Burdett, C., Theobald, D., King, S., Marco, M. D., Rondinini, C., and Boitani, L.
(2017). Quantification of habitat fragmentation reveals extinction risk in terrestrial mammals.
Proceedings of the National Academy of Sciences, 114(29):7635--7640.
Davis, J. and Goadrich, M. (2006). The relationship between precision-recall and roc curves.
In Proceedings of the 23rd international conference on Machine learning, pages 233--240.
31
Delplanque, A., Foucher, S., Lejeune, P., Linchant, J., and Th´eau, J. (2021). Multispecies
detection and identification of african mammals in aerial imagery using convolutional neural
networks. Remote Sensing in Ecology and Conservation.
Desgarnier, L., Mouillot, D., Vigliola, L., Chaumont, M., and Mannocci, L. (2022). Putting eagle
rays on the map by coupling aerial video-surveys and deep learning. Biological Conservation,
267:109494.
Divya Meena, S. and Agilandeeswari, L. (2019). An efficient framework for animal breeds
classification using semi-supervised learning and multi-part convolutional neural network
(mp-cnn). IEEE Access, 7:151783--151802.
Duporge, I., Isupova, O., Reece, S., Macdonald, D. W., and Wang, T. (2021). Using
very-high-resolution satellite imagery and deep learning to detect and count african elephants
in heterogeneous landscapes. Remote sensing in ecology and conservation, 7(3):369--381.
Eikelboom, J. A., Wind, J., van de Ven, E., Kenana, L. M., Schroder, B., de Knegt, H. J., van
Langevelde, F., and Prins, H. H. (2019). Improving the precision and accuracy of animal
population estimates with aerial image object detection. Methods in Ecology and Evolution,
10(11):1875--1887.
Esser, P., Sutter, E., and Ommer, B. (2018). A variational u-net for conditional appearance and
shape generation. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 8857--8866.
Feng, J. and Li, J. (2022). An adaptive embedding network with spatial constraints for the
use of few-shot learning in endangered-animal detection. ISPRS International Journal of
Geo-Information, 11(4):256.
Ferri, C., Hern´andez-Orallo, J., and Modroiu, R. (2009). An experimental comparison of
performance measures for classification. Pattern recognition letters, 30(1):27--38.
Ghiasi, G., Lin, T.-Y., and Le, Q. V. (2018). Dropblock: A regularization method for
convolutional networks. Advances in neural information processing systems, 31.
32
Girshick, R. (2015). Fast r-cnn in proceedings of the ieee international conference on computer
vision (pp. 1440--1448). Piscataway, NJ: IEEE.[Google Scholar].
Glowacz, A. (2021a). Fault diagnosis of electric impact drills using thermal imaging.
Measurement, 171:108815.
Glowacz, A. (2021b). Thermographic fault diagnosis of ventilation in bldc motors. Sensors,
21(21):7245.
Glowacz, A. (2021c). Ventilation diagnosis of angle grinder using thermal imaging. Sensors,
21(8):2853.
Gon¸calves, B. C., Spitzbart, B., and Lynch, H. J. (2020). Sealnet: A fully-automated pack-ice
seal detection pipeline for sub-meter satellite imagery. Remote Sensing of Environment,
239:111617.
Gonzalez, L. F., Montes, G. A., Puig, E., Johnson, S., Mengersen, K., and Gaston, K. J. (2016).
Unmanned aerial vehicles (uavs) and artificial intelligence revolutionizing wildlife monitoring
and conservation. Sensors, 16(1):97.
Guo, X., Shao, Q., Li, Y., Wang, Y., Wang, D., Liu, J., Fan, J., and Yang, F. (2018).
Application of uav remote sensing for a population census of large wild herbivores—taking
the headwater region of the yellow river as an example. Remote Sensing, 10(7):1041.
Han, J., Zhang, D., Cheng, G., Liu, N., and Xu, D. (2018). Advanced deep-learning techniques
for salient and category-specific object detection: a survey. IEEE Signal Processing Magazine,
35(1):84--100.
Harris, G., Thompson, R., Childs, J. L., and Sanderson, J. G. (2010). Automatic storage and
analysis of camera trap data. Bulletin of the Ecological Society of America, 91(3):352--360.
He, K., Gkioxari, G., Doll´ar, P., and Girshick, R. (2017). Mask r-cnn. in proceedings of the
ieee international conference on computer vision.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Spatial pyramid pooling in deep convolutional
networks for visual recognition. IEEE transactions on pattern analysis and machine
intelligence, 37(9):1904--1916.
33
He, Q., Zhao, Q., Liu, N., Chen, P., Zhang, Z., and Hou, R. (2019). Distinguishing individual
red pandas from their faces. In Lin, Z., Wang, L., Yang, J., Shi, G., Tan, T., Zheng, N.,
Chen, X., and Zhang, Y., editors, Pattern Recognition and Computer Vision, pages 714--724,
Cham. Springer International Publishing.
Hou, J., He, Y., Yang, H., Connor, T., Gao, J., Wang, Y., Zeng, Y., Zhang, J., Huang, J.,
Zheng, B., et al. (2020). Identification of animal individuals using deep learning: A case
study of giant panda. Biological Conservation, 242:108414.
Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. (2017). Densely connected
convolutional networks. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 4700--4708.
Ibraheam, M., Li, K. F., Gebali, F., and Sielecki, L. E. (2021). A performance comparison
and enhancement of animal species detection in images with various r-cnn models. AI,
2(4):552--577.
Jamil, S., Abbas, M. S., and Roy, A. M. (2022). Distinguishing malicious drones using vision
transformer. AI, 3(2):260--273.
Jask´olski, M. W. (2021). for human activity in arctic coastal environments--a review of selected
interactions and problems. Miscellanea Geographica, 25(2):127--143.
Kellenberger, B., Marcos, D., Lobry, S., and Tuia, D. (2019). Half a percent of labels is
enough: Efficient animal detection in uav imagery using deep cnns and active learning. IEEE
Transactions on Geoscience and Remote Sensing, 57(12):9524--9533.
Kellenberger, B., Marcos, D., and Tuia, D. (2018). Detecting mammals in uav images: Best
practices to address a substantially imbalanced dataset with deep learning. Remote sensing
of environment, 216:139--153.
Khaemba, W. M. and Stein, A. (2002). Improved sampling of wildlife populations using airborne
surveys. Wildlife research, 29(3):269--275.
34
Khan, W., Kumar, T., Cheng, Z., Raj, K., Roy, A. M., and Luo, B. (2022a). Sql and
nosql databases software architectures performance analysis and assessments--a systematic
literature review. arXiv preprint arXiv:2209.06977.
Khan, W., Raj, K., Kumar, T., Roy, A. M., and Luo, B. (2022b). Introducing urdu digits
dataset with demonstration of an efficient and robust noisy decoder-based pseudo example
generator. Symmetry, 14(10):1976.
Kim, J. S., Elli, G. V., and Bedny, M. (2019). Knowledge of animal appearance among sighted
and blind adults. Proceedings of the National Academy of Sciences, 116(23):11213--11222.
Kudo, H., Koshino, Y., Eto, A., Ichimura, M., and Kaeriyama, M. (2012). Cost-effective
accurate estimates of adult chum salmon, oncorhynchus keta, abundance in a japanese river
using a radio-controlled helicopter. Fisheries Research, 119:94--98.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436--444.
Lee, W. Y., Park, M., and Hyun, C.-U. (2019). Detection of two arctic birds in greenland and
an endangered bird in korea using rgb and thermal cameras with an unmanned aerial vehicle
(uav). PLOS ONE, 14(9):1--16.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll´ar, P. (2017a). Focal loss for dense object
detection. In Proceedings of the IEEE international conference on computer vision, pages
2980--2988.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll´ar, P. (2017b). Focal loss for dense object
detection. In Proceedings of the IEEE international conference on computer vision, pages
2980--2988.
Liu, S., Qi, L., Qin, H., Shi, J., and Jia, J. (2018). Path aggregation network for instance
segmentation. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 8759--8768.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., and Berg, A. (2016). Ssd:
Single shot multibox detector,in european conference on computer vision (eccv).
35
Loshchilov, I. and Hutter, F. (2017). Sgdr: Stochastic gradient descent with warm restarts.
Mannocci, L., Baidai, Y., Forget, F., Tolotti, M. T., Dagorn, L., and Capello, M. (2021).
Machine learning to detect bycatch risk: Novel application to echosounder buoys data in
tuna purse seine fisheries. Biological Conservation, 255:109004.
Meena, S. D. and Loganathan, A. (2020). Intelligent animal detection system using sparse multi
discriminative-neural network (smd-nn) to mitigate animal-vehicle collision. Environmental
Science and Pollution Research, 27:39619–39634.
Misra, D. (2020). Mish: A self regularized non-monotonic activation function.
Moreni, M., Theau, J., and Foucher, S. (2021). Train fast while reducing false positives:
Improving animal classification performance using convolutional neural networks. Geomatics,
1(1):34--49.
Naude, J. and Joubert, D. (2019). The aerial elephant dataset: A new public benchmark for
aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition Workshops, pages 48--55.
Norouzzadeh, M. S., Nguyen, A., Kosmala, M., Swanson, A., Palmer, M. S., Packer, C.,
and Clune, J. (2018). Automatically identifying, counting, and describing wild animals in
camera-trap images with deep learning. Proceedings of the National Academy of Sciences,
115(25):E5716--E5725.
O’Brien, T. (2010). Wildlife picture index and biodiversity monitoring: issues and future
directions. Animal Conservation, 13(4):350--352.
Ofli, F., Meier, P., Imran, M., Castillo, C., Tuia, D., Rey, N., Briant, J., Millet, P., Reinhard,
F., Parkan, M., et al. (2016). Combining human computing and machine learning to make
sense of big (aerial) data for disaster response. Big data, 4(1):47--59.
Parham, J., Stewart, C., Crall, J., Rubenstein, D., Holmberg, J., and Berger-Wolf, T. (2018). An
animal detection pipeline for identification. In 2018 IEEE Winter Conference on Applications
of Computer Vision (WACV), pages 1075--1083. IEEE.
36
Peng, J., Wang, D., Liao, X., Shao, Q., Sun, Z., Yue, H., and Ye, H. (2020). Wild animal
survey using uas imagery and deep learning: modified faster r-cnn for kiang detection in
tibetan plateau. ISPRS Journal of Photogrammetry and Remote Sensing, 169:364--376.
Petso, T., Jamisola, R. S., Mpoeleng, D., and Mmereki, W. (2021). Individual animal and herd
identification using custom yolo v3 and v4 with images taken from a uav camera at different
altitudes. In 2021 IEEE 6th International Conference on Signal and Image Processing
(ICSIP), pages 33--39. IEEE.
Pringle, R. M., Syfert, M., Webb, J. K., and Shine, R. (2009). Quantifying historical changes
in habitat availability for endangered species: use of pixel-and object-based remote sensing.
Journal of Applied Ecology, 46(3):544--553.
Rawat, W. and Wang, Z. (2017). Deep convolutional neural networks for image classification:
A comprehensive review. Neural computation, 29(9):2352--2449.
Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016). You only look once: Unified,
real-time object detection. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 779--788.
Redmon, J. and Farhadi, A. (2017). Yolo9000: better, faster, stronger. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 7263--7271.
Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental improvement.
Ren, S., He, K., Girshick, R., and Sun, J. (2016). Faster r-cnn: towards real-time object
detection with region proposal networks. IEEE transactions on pattern analysis and machine
intelligence, 39(6):1137--1149.
Rey, N., Volpi, M., Joost, S., and Tuia, D. (2017). Detecting animals in african savanna with
uavs and the crowds. Remote Sensing of Environment, 200:341--351.
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., and Savarese, S. (2019). Generalized
intersection over union: A metric and a loss for bounding box regression. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 658--666.
37
Roy, A. M. (2021). Finite element framework for efficient design of three dimensional
multicomponent composite helicopter rotor blade system. Eng, 2(1):69--79.
Roy, A. M. (2022a). Adaptive transfer learning-based multiscale feature fused deep convolutional
neural network for eeg mi multiclassification in brain--computer interface. Engineering
Applications of Artificial Intelligence, 116:105347.
Roy, A. M. (2022b). An efficient multi-scale CNN model with intrinsic feature integration for
motor imagery EEG subject classification in brain-machine interfaces. Biomedical Signal
Processing and Control, 74:103496.
Roy, A. M. (2022c). A multi-scale fusion cnn model based on adaptive transfer learning for
multi-class mi-classification in bci system. BioRxiv.
Roy, A. M. and Bhaduri, J. (2021). A deep learning enabled multi-class plant disease detection
model based on computer vision. AI, 2(3):413--428.
Roy, A. M. and Bhaduri, J. (2022). Real-time growth stage detection model for high degree
of occultation using densenet-fused YOLOv4. Computers and Electronics in Agriculture,
193:106694.
Roy, A. M., Bose, R., and Bhaduri, J. (2022). A fast accurate fine-grain object detection model
based on YOLOv4 deep neural network. Neural Computing and Applications, pages 1--27.
Ruff, Z. J., Lesmeister, D. B., Appel, C. L., and Sullivan, C. M. (2021). Workflow and
convolutional neural network for automated identification of animal sounds. Ecological
Indicators, 124:107419.
Saxena, A., Gupta, D. K., and Singh, S. (2021). An animal detection and collision avoidance
system using deep learning. In Advances in Communication and Computational Technology,
pages 1069--1084. Springer.
Schindler, F. and Steinhage, V. (2021). Identification of animals and recognition of their actions
in wildlife videos using deep learning techniques. Ecological Informatics, 61:101215.
38
Singh, A., Pietrasik, M., Natha, G., Ghouaiel, N., Brizel, K., and Ray, N. (2020). Animal
detection in man-made environments. In 2020 IEEE Winter Conference on Applications of
Computer Vision (WACV), pages 1427--1438.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).
Dropout: a simple way to prevent neural networks from overfitting. The journal of machine
learning research, 15(1):1929--1958.
Stern, E. R. and Humphries, M. M. (2022). Interweaving local, expert, and indigenous knowledge
into quantitative wildlife analyses: A systematic review. Biological Conservation, 266:109444.
Taheri, S. and
¨
Onsen Toygar (2018). Animal classification using facial images with score-level
fusion. IET Computer Vision, 12:679--685(6).
Torney, C. J., Lloyd-Jones, D. J., Chevallier, M., Moyer, D. C., Maliti, H. T., Mwita, M.,
Kohi, E. M., and Hopcraft, G. C. (2019). A comparison of deep learning and citizen science
techniques for counting wildlife in aerial survey images. Methods in Ecology and Evolution,
10(6):779--787.
Tzutalin (2015). Labelimg.
Voulodimos, A., Doulamis, N., Doulamis, A., and Protopapadakis, E. (2018). Deep learning for
computer vision: A brief review. Computational intelligence and neuroscience, 2018.
Wang, D., Shao, Q., and Yue, H. (2019). Surveying wild animals from satellites, manned
aircraft and unmanned aerial systems (uass): A review. Remote Sensing, 11(11):1308.
Xiao, Z., Xu, X., Xing, H., Luo, S., Dai, P., and Zhan, D. (2021a). Rtfn: a robust temporal
feature network for time series classification. Information Sciences, 571:65--86.
Xiao, Z., Xu, X., Xing, H., Song, F., Wang, X., and Zhao, B. (2021b). A federated learning
system with enhanced feature extraction for human activity recognition. Knowledge-Based
Systems, 229:107338.
Xiao, Z., Xu, X., Zhang, H., and Szczerbicki, E. (2021c). A new multi-process collaborative
architecture for time series classification. Knowledge-Based Systems, 220:106934.
39
Xing, H., Xiao, Z., Qu, R., Zhu, Z., and Zhao, B. (2022a). An efficient federated distillation
learning system for multitask time series classification. IEEE Transactions on Instrumentation
and Measurement, 71:1--12.
Xing, H., Xiao, Z., Zhan, D., Luo, S., Dai, P., and Li, K. (2022b). Selfmatch: Robust
semisupervised time-series classification with self-distillation. International Journal of
Intelligent Systems.
Yao, Z., Cao, Y., Zheng, S., Huang, G., and Lin, S. (2021). Cross-iteration batch normalization.
Zhao, Z.-Q., Zheng, P., Xu, S.-t., and Wu, X. (2019a). Object detection with deep learning: A
review. IEEE transactions on neural networks and learning systems, 30(11):3212--3232.
Zhao, Z.-Q., Zheng, P., Xu, S.-t., and Wu, X. (2019b). Object detection with deep learning: A
review. IEEE transactions on neural networks and learning systems, 30(11):3212--3232.
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., and Ren, D. (2020). Distance-iou loss: Faster
and better learning for bounding box regression. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 34, pages 12993--13000.
Zhu, X. X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., and Fraundorfer, F. (2017). Deep
learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience
and Remote Sensing Magazine, 5(4):8--36.
Zotin, A. G. and Proskurin, A. V. (2019). Animal detection using a series of images under
complex shooting conditions. The International Archives of the Photogrammetry, Remote
Sensing and Spatial Information Sciences, XLII-2/W12:249--257.
... Some commercial systems like AmviCube [31], RIPMAPP [32,33] are also available for estimating the quality of paddy and rice, and deep learning has been widely used in different domains [34][35][36][37][38][39][40][41][42]. However, these systems are based on sophisticated and expensive setups, and skilled labor is needed to use these products. ...
Article
Full-text available
Paddy (Oryza sativa) is one of the most consumed food grains in the world. The process from its sowing to consumption via harvesting, processing, storage and management require much effort and expertise. The grain quality of the product is heavily affected by the weather conditions, irrigation frequency, and many other factors. However, quality control is of immense importance, and thus, the evaluation of grain quality is necessary. Since it is necessary and arduous, we try to overcome the limitations and shortcomings of grain quality evaluation using image processing and machine learning (ML) techniques. Most existing methods are designed for rice grain quality assessment, noting that the key characteristics of paddy and rice are different. In addition, they have complex and expensive setups and utilize black-box ML models. To handle these issues, in this paper, we propose a reliable ML-based IoT paddy grain quality assessment system utilizing affordable sensors. It involves a specific data collection procedure followed by image processing with an ML-based model to predict the quality. Different explainable features are used for classifying the grain quality of paddy grain, like the shape, size, moisture, and maturity of the grain. The precision of the system was tested in real-world scenarios. To our knowledge, it is the first automated system to precisely provide an overall quality metric. The main feature of our system is its explainability in terms of utilized features and fuzzy rules, which increases the confidence and trustworthiness of the public toward its use. The grain variety used for experiments majorly belonged to the Indian Subcontinent, but it covered a significant variation in the shape and size of the grain.
... In recent years, deep learning (DL) has shown revolutionary advancements in various disciplines, such as but not restricted to, computer vision [3,7], object detection [8,9], image/ signal classification [1,2,4,10], brain-computer interfaces [5,6,11], physics-informed neural networks [16][17][18] etc. There exist various state-of-the-art (SOTA) techniques [12][13][14]66] for simultaneous localization and mapping of a robot in a random environment. It provides a B Arunabha M. Roy arunabhr.umich@gmail.com ...
Article
Full-text available
In recent years, the demand for robots is not only limited to sophisticated industrial setups, there exists an unprecedented demand for low-cost robots in living places with the capabilities of performing human-centric operations. For the semantic-rich mapping of random environments, current state-of-the-art techniques include sophisticated hardware like Kinect sensor, Lidar, deep learning (DL)-based vision, and stereo vision-based systems. Inevitably, these systems increase the cost of the product which requires expensive hardware for processing the information. It, therefore, creates a hurdle to implementing them on low-cost service robots where interaction matters more than precision. To overcome these issues, in this paper, we propose two novel techniques: 1) a light, yet efficient, semantic mapping technique for scene-wise localization of objects by combining object detection and camera geometry; 2) an accurate and robust novel integration technique for coalition of scene-wise information for large-scale maps. The main goal of this framework is to host a semantic mapping process on a limited processing device like Raspberry Pi. The semantic information can be further integrated into any Human-Robot Interaction (HRI) system. A tensorflow-lite version of Single Shot Detection (SSD) for object detection, a wheel odometer for odometry tracking, and pinhole camera geometry are used for the whole mapping process. The proposed model has demonstrated promising results by accurately mapping the environment with semantic-rich features. Current work is time efficient and suitable for object-orientated task execution of low-cost robots, such as smart toys and other smart home gadgets.
... Deep learning (DL) has been very successful in a range of different data domains including image classification [4,5,6,45,8,30,31,32,9,27,40,45,42], audio [1,29,2,3], text [11,12], computer vision [10,34,47], object detection [33,26,28], object segmentation [43,44,45,46],brain-computer interface [35,37,36], and across diverse scientific disciplines [39,38]. Intellectual Disability (ID) with DL has rarely been discussed and ID became a rising problem in modern society. ...
Preprint
In a growing world of technology, psychological disorders became a challenge to be solved. The methods used for cognitive stimulation are very conventional and based on one-way communication, which only rely on the material or method used for training of an individual. It doesn’t use any kind of feedback from the individual to analyze the progress of the training process. We have proposed a closed-loop methodology to improve the cognitive state of a person with ID (Intellectual disability). We have used a platform named ‘Armoni’, for providing training to the intellectually disabled individuals. The learning is performed in a closed-loop by using feedback in the form of change in affective state. For feedback to the Armoni, an EEG (Electroencephalograph) headband is used. All the changes in EEG are observed and classified against the change in the mean and standard deviation value of all frequency bands of signal. This comparison is being helpful in defining every activity with respect to change in brain signals. In this paper, we have discussed the process of treatment of EEG signal and its definition against the different activities of Armoni. We have tested it on 6 different systems with different age groups and cognitive levels.
... Some commercial systems like AmviCube [28], RIPMAPP [29,30] are also available for quality estimation of paddy and rice and deep learning has wide been used in different domain [31][32][33][34][35][36][37][38][39]. However, these systems are based on sophisticated and expensive setups , and they need skilled labour to use these products. ...
Preprint
In the realm of computer vision, paddy (Oryza Sativa) plays a pivotal role as a globally consumed staple crop. Its cultivation, harvesting, processing, and storage involve intricate quality control. Numerous factors, including weather conditions and irrigation frequency, influence grain quality. To address this, we present an innovative approach that combines image processing and machine learning (ML). Existing methods for rice grain quality assessment, while valuable, are tailored to rice-specific characteristics, employing complex and costly setups and opaque ML models. Our research overcomes these limitations with a robust ML-based IoT system for paddy grain quality assessment, using affordable sensors, a comprehensive data collection process, and an ML-driven image processing model. Importantly, our approach utilizes interpretable features like Shape, Size, Moisture, and Maturity for paddy grain classification. Rigorous real-world testing confirms its precision, marking it as the first automated system capable of providing a reliable overall quality metric. Our system’s unique feature lies in its transparency, with clear features and fuzzy rules, inspiring confidence and trust. While our experiments primarily feature Indian Subcontinent grain varieties, the system’s adaptability to diverse paddy types is evident, contributing significantly to computer vision.
... Deep learning (DL) has successfully addressed complex problems, proving proficiency in managing large datasets and discerning intricate patterns. Consequently, DL has become an indispensable tool for various tasks, including image processing [4,19,20,24,23,22,21,43,44,45,46,47], natural language processing [26,27], and audio processing [28,29,32,31,30,52] and other DL application [48,50,53,54,55,59,56,57,59]. Notably, DL has demonstrated impressive performance in the field of audio data analysis. ...
Preprint
Data augmentation has proven to be effective in training neural networks. Recently, a method called RandAug was proposed, randomly selecting data augmentation techniques from a predefined search space. RandAug has demonstrated significant performance improvements for image-related tasks while imposing minimal computational overhead. However, no prior research has explored the application of RandAug specifically for audio data augmentation, which converts audio into an image-like pattern. To address this gap, we introduce AudRandAug, an adaptation of RandAug for audio data. AudRandAug selects data augmentation policies from a dedicated audio search space. To evaluate the effectiveness of AudRandAug, we conducted experiments using various models and datasets. Our findings indicate that AudRandAug outperforms other existing data augmentation methods regarding accuracy performance.
... Pakistan Railways provides a prime means of transportation to remote corners of the country and brings everyone closer to business, tourism, pilgrimage, and education. It has been a considerable integrating force and forms the lifeline of the country by fulfilling its needs for a mass movement of people and transportation of goods [3], [37]. ...
Preprint
Full-text available
After the pandemic, humanity has been facing different types of challenges. Social relationships, societal values, and academic and professional behavior have been hit the most. People are shifting their routines to social media and gadgets, and getting addicted to their isolation. This sudden change in their lives has caused an unusual social breakdown and endangered their mental health. In mid-2021, Pakistan's first Human Library was established under HelpingMind to overcome these effects. Despite online sessions and webinars, HelpingMind needs technology to reach the masses. In this work, we customized the UI or UX of a Go Together Mobile Application (GTMA) to meet the requirements of the client organization. A very interesting concept of the book (expert listener or psychologist) and the reader is introduced in GTMA. It offers separate dashboards, separate reviews or rating systems, booking, and venue information to engage the human reader with his or her favorite human book. The loyalty program enables the members to avail discounts through a mobile application and its membership is global where both the human-reader and human-books can register under the platform. The minimum viable product has been approved by our client organization.
... More recently, deep learning (DL) characterized by multilayer neural networks (NN) [24] has shown remarkable breakthroughs in pattern recognition for various fields including image classification [25][26][27][28], computer vision [29][30][31][32][33], object detection [34][35][36][37][38], brain-computer interfaces [39][40][41][42], signal classification [43,44] and across diverse scientific disciplines [45][46][47][48][49]. Following the success, there is an increasing thrust of research works geared towards damage classification tasks employing DL techniques, mostly convolutional neural networks (CNN), such as ResNet [50], AlexNet [51,52], VGG-net [53,54] and various others [55][56][57]. ...
... More recently, deep learning (DL) characterized by multilayer neural networks (NN) (LeCun et al., 2015) has shown remarkable breakthroughs in pattern recognition for various fields including image classification (Rawat and Wang, 2017;Khan et al., 2022b,a), computer vision (Voulodimos et al., 2018;Roy and Bhaduri, 2021;Roy et al., 2022c;Roy and Bhaduri, 2022;Roy et al., 2022a), object detection (Zhao et al., 2019a;Chandio et al., 2022;Roy et al., 2022b;Singh et al., 2023a), brain-computer interfaces (Roy, 2022b,a,c;Singh et al., 2023b), signal classification Roy, 2023, 2022) and across diverse scientific disciplines (Bose and Roy, 2022;Roy and Bose, 2023b;Roy and Guha, 2022;Roy and Bose, 2023a;Roy and Guha, 2023). Following the success, there is an increasing thrust of research works geared towards damage classification tasks employing DL techniques, mostly convolutional neural networks (CNN), such as ResNet (Bang et al., 2018), AlexNet (Dorafshan et al., 2018;Li et al., 2018), VGG-net (Gopalakrishnan et al., 2017;Silva and Lucena, 2018) and various others (Chow et al., 2020;Nath et al., 2022;. ...
Preprint
Full-text available
Objective:Computer vision-based up-to-date accurate damage classification and localization are of decisive importance for infrastructure monitoring, safety, and the serviceability of civil infrastructure. Current state-of-the-art deep learning (DL)-based damage detection models, however, often lack superior feature extraction capability in complex and noisy environments, limiting the development of accurate and reliable object distinction. Method: To this end, we present DenseSPH-YOLOv5, a real-time DL-based high-performance damage detection model where DenseNet blocks have been integrated with the backbone to improve in preserving and reusing critical feature information. Additionally, convolutional block attention modules (CBAM) have been implemented to improve attention performance mechanisms for strong and discriminating deep spatial feature extraction that results in superior detection under various challenging environments. Moreover, additional feature fusion layers and a Swin-Transformer Prediction Head (SPH) have been added leveraging advanced self-attention mechanism for more efficient detection of multiscale object sizes and simultaneously reducing the computational complexity. Results: Evaluating the model performance in large-scale Road Damage Dataset (RDD-2018), at a detection rate of 62.4 FPS, DenseSPH-YOLOv5 obtains a mean average precision (mAP) value of 85.25 %, F1-score of 81.18 %, and precision (P) value of 89.51 % outperforming current state-of-the-art models. Significance: The present research provides an effective and efficient damage localization model addressing the shortcoming of existing DL-based damage detection models by providing highly accurate localized bounding box prediction. Current work constitutes a step towards an accurate and robust automated damage detection system in real-time in-field applications.
Article
The paper presents an efficient and robust data-driven deep learning (DL) computational framework developed for linear continuum elasticity problems. The methodology is based on the fundamentals of the Physics Informed Neural Networks (PINNs). For an accurate representation of the field variables, a multi-objective loss function is proposed. It consists of terms corresponding to the residual of the governing partial differential equations (PDE), constitutive relations derived from the governing physics, various boundary conditions, and data-driven physical knowledge fitting terms across randomly selected collocation points in the problem domain. To this end, multiple densely connected independent artificial neural networks (ANNs), each approximating a field variable, are trained to obtain accurate solutions. Several benchmark problems including the Airy solution to elasticity and the Kirchhoff-Love plate problem are solved. Performance in terms of accuracy and robustness illustrates the superiority of the current framework showing excellent agreement with analytical solutions. The present work combines the benefits of the classical methods depending on the physical information available in analytical relations with the superior capabilities of the DL techniques in the data-driven construction of lightweight, yet accurate and robust neural networks. The models developed herein can significantly boost computational speed using minimal network parameters with easy adaptability in different computational platforms.
Article
Background and objectives: Valvular heart diseases (VHDs) are one of the dominant causes of cardiovascular abnormalities that have been associated with high mortality rates globally. Rapid and accurate diagnosis of the early stage of VHD based on cardiac phonocardiogram (PCG) signal is critical that allows for optimum medication and reduction of mortality rate. Methods: To this end, the current study proposes novel deep learning (DL)-based high-performance VHD detection frameworks that are relatively simpler in terms of network structures, yet effective for accurately detecting multiple VHDs. We present three different frameworks considering both 1D and 2D PCG raw signals. For 1D PCG, Mel frequency cepstral coefficients (MFCC) and linear prediction cepstral coefficients (LPCC) features, whereas, for 2D PCG, various D-CNN features are extracted. Additionally, nature/bio-inspired algorithms (NIA/BIA) including particle swarm optimization (PSO) and genetic algorithm (GA) have been utilized for automatic and efficient feature selection directly from the raw PCG signal. To further improve the performance of the classifier, vision transformer (ViT) has been implemented levering the self-attention mechanism on the time frequency representation (TFR) of 2D PCG signal. Our extensive study presents a comparative performance analysis and the scope of enhancement for the combination of different descriptors, classifiers, and feature selection algorithms. Main Results: Among all classifiers, ViT provides the best performance by achieving mean average accuracy Acc of 99.90 % and F1-score of 99.95 % outperforming current state-of-the-art VHD classification models. Conclusions: The present research provides a robust and efficient DL-based end-to-end PCG signal classification framework for designing a automated high-performance VHD diagnosis system.
Article
Full-text available
The competent software architecture plays a crucial role in the difficult task of big data processing for SQL and NoSQL databases. SQL databases were created to organize data and allow for horizontal expansion. NoSQL databases, on the other hand, support horizontal scalability and can efficiently process large amounts of unstructured data. Organizational needs determine which paradigm is appropriate, yet selecting the best option is not always easy. Differences in database design are what set SQL and NoSQL databases apart. Each NoSQL database type also consistently employs a mixed-model approach. Therefore, it is challenging for cloud users to transfer their data among different cloud storage services (CSPs). There are several different paradigms being monitored by the various cloud platforms (IaaS, PaaS, SaaS, and DBaaS). The purpose of this SLR is to examine the articles that address cloud data portability and interoperability, as well as the software architectures of SQL and NoSQL databases. Numerous studies comparing the capabilities of SQL and NoSQL of databases, particularly Oracle RDBMS and NoSQL Document Database (MongoDB), in terms of scale, performance, availability, consistency, and sharding, were presented as part of the state of the art. Research indicates that NoSQL databases, with their specifically tailored structures, may be the best option for big data analytics, while SQL databases are best suited for online transaction processing (OLTP) purposes.
Article
Full-text available
In the present work, we propose a novel method utilizing only a decoder for generation of pseudo-examples, which has shown great success in image classification tasks. The proposed method is particularly constructive when the data are in a limited quantity used for semi-supervised learning (SSL) or few-shot learning (FSL). While most of the previous works have used an autoencoder to improve the classification performance for SSL, using a single autoencoder may generate confusing pseudo-examples that could degrade the classifier’s performance. On the other hand, various models that utilize encoder– decoder architecture for sample generation can significantly increase computational overhead. To address the issues mentioned above, we propose an efficient means of generating pseudo-examples by using only the generator (decoder) network separately for each class that has shown to be effective for both SSL and FSL. In our approach, the decoder is trained for each class sample using random noise, and multiple samples are generated using the trained decoder. Our generator-based approach outperforms previous state-of-the-art SSL and FSL approaches. In addition, we released the Urdu digits dataset consisting of 10,000 images, including 8000 training and 2000 test images collected through three different methods for purposes of diversity. Furthermore, we explored the effectiveness of our proposed method on the Urdu digits dataset by using both SSL and FSL, which demonstrated improvement of 3.04% and 1.50% in terms of average accuracy, respectively, illustrating the superiority of the proposed method compared to the current state-of-the-art models.
Article
Full-text available
Objective Deep learning (DL)-based brain–computer interface (BCI) in motor imagery (MI) has emerged as a powerful method for establishing direct communication between the brain and external electronic devices. However, due to inter-subject variability, inherent complex properties, and low signal-to-noise ratio (SNR) in electroencephalogram (EEG) signals are major challenges that significantly hinder the accuracy of the MI classifier. Approach To overcome this, the present work proposes an efficient transfer learning (TL)-based multi-scale feature fused CNN (MSFFCNN) which can capture the distinguishable features of various non-overlapping canonical frequency bands of EEG signals from different convolutional scales for multi-class MI classification. Significance In order to account for inter-subject variability from different subjects, the current work presents 4 different model variants including subject-independent and subject-adaptive classification models considering different adaptation configurations to exploit the full learning capacity of the classifier. Each adaptation configuration has been fine-tuned in an extensively trained pre-trained model and the performance of the classifier has been studied for a vast range of learning rates and degrees of adaptation which illustrates the advantages of using an adaptive transfer learning-based model. Results The model achieves an average classification accuracy of 94.06% (±0.70%) and the kappa value of 0.88 outperforming several baseline and current state-of-the-art EEG-based MI classification models with fewer training samples. The present research provides an effective and efficient transfer learning-based end-to-end MI classification framework for designing a high-performance robust MI-BCI system.
Article
Full-text available
This paper proposes an efficient federated distillation learning system (EFDLS) for multi-task time series classification (TSC). EFDLS consists of a central server and multiple mobile users, where different users may run different TSC tasks. EFDLS has two novel components: a feature-based student-teacher (FBST) framework and a distance-based weights matching (DBWM) scheme. For each user, the FBST framework transfers knowledge from its teacher’s hidden layers to its student’s hidden layers via knowledge distillation, where the teacher and student have identical network structures. For each connected user, its student model’s hidden layers’ weights are uploaded to the EFDLS server periodically. The DBWM scheme is deployed on the server, with the least square distance used to measure the similarity between the weights of two given models. This scheme finds a partner for each connected user such that the user’s and its partner’s weights are the closest among all the weights uploaded. The server exchanges and sends back the user’s and its partner’s weights to these two users which then load the received weights to their teachers’ hidden layers. Experimental results show that compared with a number of state-of-the-art federated learning algorithms, our proposed EFDLS wins 20 out of 44 standard UCR2018 datasets and achieves the highest mean accuracy (70.14%) on these datasets. In particular, compared with a single-task Baseline, EFDLS obtains 32/4/8 regarding ’win’/’tie’/’lose’ and results in an improvement of approximately 4% in terms of mean accuracy.
Article
Full-text available
Over the years, a number of semisupervised deep-learning algorithms have been proposed for time-series classification (TSC). In semisupervised deep learning, from the point of view of representation hierarchy, semantic information extracted from lower levels is the basis of that extracted from higher levels. The authors wonder if high-level semantic information extracted is also helpful for capturing low-level semantic information. This paper studies this problem and proposes a robust semisupervised model with self-distillation (SD) that simplifies existing semisupervised learning (SSL) techniques for TSC, called SelfMatch. SelfMatch hybridizes supervised learning, unsupervised learning, and SD. In unsupervised learning, SelfMatch applies pseudolabeling to feature extraction on labeled data. A weakly augmented sequence is used as a target to guide the prediction of a Timecut-augmented version of the same sequence. SD promotes the knowledge flow from higher to lower levels, guiding the extraction of low-level semantic information. This paper designs a feature extractor for TSC, called ResNet–LSTMaN, responsible for feature and relation extraction. The experimental results show that SelfMatch achieves excellent SSL performance on 35 widely adopted UCR2018 data sets, compared with a number of state-of-the-art semisupervised and supervised algorithms.
Article
Full-text available
Image recording is now ubiquitous in the fields of endangered-animal conservation and GIS. However, endangered animals are rarely seen, and, thus, only a few samples of images of them are available. In particular, the study of endangered-animal detection has a vital spatial component. We propose an adaptive, few-shot learning approach to endangered-animal detection through data augmentation by applying constraints on the mixture of foreground and background images based on species distributions. First, the pre-trained, salient network U2-Net segments the foregrounds and backgrounds of images of endangered animals. Then, the pre-trained image completion network CR-Fill is used to repair the incomplete environment. Furthermore, our approach identifies a foreground–background mixture of different images to produce multiple new image examples, using the relation network to permit a more realistic mixture of foreground and background images. It does not require further supervision, and it is easy to embed into existing networks, which learn to compensate for the uncertainties and nonstationarities of few-shot learning. Our experimental results are in excellent agreement with theoretical predictions by different evaluation metrics, and they unveil the future potential of video surveillance to address endangered-animal detection in studies of their behavior and conservation.
Article
Full-text available
Drones are commonly used in numerous applications, such as surveillance, navigation, spraying pesticides in autonomous agricultural systems, various military services, etc., due to their variable sizes and workloads. However, malicious drones that carry harmful objects are often adversely used to intrude restricted areas and attack critical public places. Thus, the timely detection of malicious drones can prevent potential harm. This article proposes a vision transformer (ViT) based framework to distinguish between drones and malicious drones. In the proposed ViT based model, drone images are split into fixed-size patches; then, linearly embeddings and position embeddings are applied, and the resulting sequence of vectors is finally fed to a standard ViT encoder. During classification, an additional learnable classification token associated to the sequence is used. The proposed framework is compared with several handcrafted and deep convolutional neural networks (D-CNN), which reveal that the proposed model has achieved an accuracy of 98.3%, outperforming various handcrafted and D-CNNs models. Additionally, the superiority of the proposed model is illustrated by comparing it with the existing state-of-the-art drone-detection methods.
Article
Full-text available
Reliable and efficient techniques are urgently needed to monitor elasmobranch populations that face increasing threats worldwide. Aerial video-surveys provide precise and verifiable observations for the rapid assessment of species distribution and abundance in coral reefs, but the manual processing of videos is a major bottleneck for timely conservation applications. In this study, we applied deep learning for the automated detection and mapping of vulnerable eagle rays from aerial videos. A light aircraft dedicated to touristic flights allowed us to collect 42 h of aerial video footage over a shallow coral lagoon in New Caledonia (Southwest Pacific). We extracted the videos at a rate of one image per second before annotating them, yielding 314 images with eagle rays. We then trained a convolutional neural network with 80% of the eagle ray images and evaluated its accuracy on the remaining 20% (independent data sets). Our deep learning model detected 92% of the annotated eagle rays in a diversity of habitats and acquisition conditions across the studied coral lagoon. Our study offers a potential breakthrough for the monitoring of ray populations in coral reef ecosystems by providing a fast and accurate alternative to the manual processing of aerial videos. Our deep learning approach can be extended to the detection of other elasmobranchs and applied to systematic aerial surveys to not only detect individuals but also estimate species density in coral reef habitats.
Article
Full-text available
In this paper, a deep learning enabled object detection model for multi-class plant disease has been proposed based on a state-of-the-art computer vision algorithm. While most existing models are limited to disease detection on a large scale, the current model addresses the accurate detection of fine-grained, multi-scale early disease detection. The proposed model has been improved to optimize for both detection speed and accuracy and applied to multi-class apple plant disease detection in the real environment. The mean average precision (mAP) and F1-score of the detection model reached up to 91.2% and 95.9%, respectively, at a detection rate of 56.9 FPS. The overall detection result demonstrates that the current algorithm significantly outperforms the state-of-the-art detection model with a 9.05% increase in precision and 7.6% increase in F1-score. The proposed model can be employed as an effective and efficient method to detect different apple plant diseases under complex orchard scenarios.