Content uploaded by Arunabha M Roy
Author content
All content in this area was uploaded by Arunabha M Roy on Feb 03, 2023
Content may be subject to copyright.
A computer vision-based object localization model for endangered wildlife
detection
Arunabha M. Roy∗1, Jayabrata Bhaduri2, Teerath Kumar3, and Kislay Raj3
1Aerospace Engineering Department, University of Michigan, Ann Arbor, MI
48109, USA
2
Capacloud AI, Deep Learning &Data Science Division, Kolkata, WB 711103,
India.
3School of Computing, Dublin City University, Dublin 9, Ireland
Abstract
Objective. With climatic instability, various ecological disturbances, and human actions threaten
the existence of various endangered wildlife species. Therefore, an up-to-date accurate and
detailed detection process plays an important role in protecting biodiversity losses, conservation,
and ecosystem management. Current state-of-the-art wildlife detection models, however, often
lack superior feature extraction capability in complex environments, limiting the development
of accurate and reliable detection models. Method. To this end, we present WilDect-YOLO, a
deep learning (DL)-based automated high-performance detection model for real-time endangered
wildlife detection. In the model, we introduce a residual block in the CSPDarknet53 backbone
for strong and discriminating deep spatial features extraction and integrate DenseNet blocks to
improve in preserving critical feature information. To enhance receptive field representation,
preserve fine-grain localized information, and improve feature fusion, a Spatial Pyramid Pooling
∗Corresponding author, 4/09/2022
1
2
(SPP) and modified Path Aggregation Network (PANet) have been implemented that results
in superior detection under various challenging environments. Results. Evaluating the model
performance in a custom endangered wildlife dataset considering high variability and complex
backgrounds, WilDect-YOLO obtains a mean average precision (mAP) value of 96
.
89%, F1-score
of 97
.
87%, and precision value of 97
.
18% at a detection rate of 59.20 FPS outperforming current
state-of-the-art models. Significance. The present research provides an effective and efficient
detection framework addressing the shortcoming of existing DL-based wildlife detection models
by providing highly accurate species-level localized bounding box prediction. Current work
constitutes a step towards a non-invasive, fully automated animal observation system in
real-time in-field applications.
Keywords: Endangered wildlife detection; You Only Look Once (YOLOv4) algorithm; Object
Detection (OD); Computer vision; Deep Learning (DL); Wildlife Preservation
1. Introduction :
In recent years, automated wildlife detection plays a critical role in wildlife survey (Peng
et al.,2020;Chalmers et al.,2021;Delplanque et al.,2021), conservation (Khaemba and Stein,
2002;O’Brien,2010), and ecosystem management (Austrheim et al.,2014;Harris et al.,2010)
to tackle worldwide accelerated biodiversity crisis. Up-to-date detailed and accurate wildlife
data can be beneficial in preventing biodiversity losses, ecosystem damage, and poaching
(Norouzzadeh et al.,2018;Petso et al.,2021). While traditional wildlife survey techniques
mainly include distance sampling (Aebischer et al.,2017), camera trapping (Chauvenet et al.,
2017), and satellite monitoring (Chauvenet et al.,2017), however, such traditional techniques
have disadvantages due to lower efficiency, high cost, the requirement of qualified personals, and
their individual bias (Guo et al.,2018). Similarly, wild animal surveys with aerial image object
detection generally suffer from low accuracy due to complex backgrounds and disturbances
among wild animals (Eikelboom et al.,2019). Moreover, satellite-based monitoring methods
require very-high-resolution satellite imagery which are limited for relatively larger-sized animals
(Wang et al.,2019).
3
To circumvent such issues, various automatic and semi-automatic detection algorithms for
wildlife animals have been adopted, in particular, from unmanned aircraft systems (UASs)
imagery (Gonzalez et al.,2016;Ofli et al.,2016). Additionally, pixel-based classification methods
that include threshold setting, supervised, and unsupervised classification have been popular
methods for detecting animals in remote sensing images (Pringle et al.,2009;Kudo et al.,2012).
However, these methods are not adequate for detecting targets with similar gray-scale values
with the complex background (Wang et al.,2019). To detect targets in complex environments,
various machine learning (ML) methods have been employed to localize objects combining
rotation-invariant object descriptors for automated wildlife detection (Cheng and Han,2016).
Although, traditional ML yields encouraging results in relatively simple scenarios, however,
they are not adequate and robust methods for detecting complicated animal features such as
structure, texture, morphology, etc (Rey et al.,2017;Peng et al.,2020).
More recently, driven by big-data methods (Khan et al.,2022a), deep learning (DL)
characterized by multilayer neural networks (NN) (LeCun et al.,2015) has shown remarkable
breakthroughs in pattern recognition for various fields including image classification (Rawat
and Wang,2017;Jamil et al.,2022;Khan et al.,2022b), computer vision (Voulodimos et al.,
2018;Chandio et al.,2022), object detection (Zhao et al.,2019a;Roy and Bhaduri,2021;Roy
et al.,2022;Roy and Bhaduri,2022), time-series classification (Xiao et al.,2021a,c;Xing et al.,
2022a,b), brain-computer interface (Roy,2022b,a,c), and across diverse scientific disciplines
(Zhu et al.,2017;Roy,2021;Bose and Roy,2022). Particularly in object localization, DL
methods have demonstrated superior accuracy (Han et al.,2018) that can be categorized into
two classes: two-stage and one-stage detector (Lin et al.,2017a). Two-stage detectors including
Region Convolution Neural Network (RCNN) (Girshick,2015), faster-RCNN (Ren et al.,2016),
mask-RCNN (He et al.,2017) etc have shown a significant improvement in accuracy in object
localization. In recent times, You Only Look Once (YOLO) variants (Redmon et al.,2016;
Redmon and Farhadi,2017,2018;Bochkovskiy et al.,2020) have been proposed that unify
target classification and localization leading to significant improvement in the detection speed
(Roy et al.,2022;Roy and Bhaduri,2022,2021). Therefore, driven by advances in computer
vision technologies, wildlife detection is rapidly transforming into a data-rich discipline and has
been applied in the automated detection of a variety of wildlife species (Eikelboom et al.,2019;
4
Gon¸calves et al.,2020;Duporge et al.,2021). Along the similar line, various DL methodologies
such as convolutional neural network (CNN) (Kellenberger et al.,2018), RetinaNet (Eikelboom
et al.,2019), ResNet-50 (Chabot et al.,2022), YOLOv3 (Torney et al.,2019), Faster R-CNN
(Peng et al.,2020), Libra-RCNN (Delplanque et al.,2021) etc have demonstrated high precision
in object localization and can be deployed as a reliable and predictable model for automated
wildlife detection.
Motivations : The main motivation of the present study is to design an efficient and robust
computer vision-based algorithm for the accurate classification and localization of endangered
wildlife species. Climatic instability and various human activities such as thawing, hunting,
oil drilling, etc threaten the existence of various endangered animals and create damage to
ecosystems (Jask´olski,2021). Species that inhabit such ecosystems are highly specialized to
live in adverse weather conditions, which is why such changes affect them severely (Crooks
et al.,2017). Thus, it is crucial to build an accurate automated endangered wildlife detection
model to conserve and protect the species and the ecosystem. Although, there exists several
state-of-the-art works for wildlife detection (Barbedo et al.,2019;Naude and Joubert,2019;
Peng et al.,2020;Moreni et al.,2021) including multi-species animal detection (Eikelboom et al.,
2019;Delplanque et al.,2021), however, they often suffer from low accuracy, missed detection,
and relatively large computational overhead. Additionally, there is no systemic study, as per
the authors’ best knowledge, that addresses the challenge of detecting and accurate localization
of multiple endangered wildlife species that is worthy of further investigation. To this end,
the current works aim to develop an efficient and robust endangered wildlife classification and
accurate object localization model simultaneously productive in terms of training time and
computational cost which is currently missing in recent state-of-the-art models for endangered
wildlife detection.
Challenges : Despite illustrating outstanding performance in detecting wildlife species, current
state-of-the-art DL algorithms are still not suitable due to their insufficient fine-grain feature
extraction capability leading to missed detection and false object predictions for endangered
species which posses unique body textures, shapes, sizes, and colors (Kim et al.,2019). Between
5
various species, accurate detection and localization tasks can be challenging due to significant
variability of lightening conditions, low visibility, high degree of osculation and overlap,
the coexistence of multi-object classes with various aspect ratios, and other morphological
characteristics (Chabot et al.,2019). Additionally, visual similarities, complex background
and the low distinguishable interface between species and their surroundings, and various
other critical factors offer additional challenges and difficulties for the state-of-the-art wildlife
detection models (Feng and Li,2022).
To address the aforementioned shortcomings, in the current study, we present WilDect-YOLO,
based on an improved version of the state-of-art YOLOv4 detection model for accurate real-time
endangered wildlife detection. In WilDect-YOLO, we integrate DenseNet blocks to improve
preserving critical feature information and reuse. In addition, two residual blocks have been
carefully designed in the CSPDarknet53 backbone for strong and discriminating deep spatial
features extraction. Furthermore, Spatial Pyramid Pooling (SPP) has been tightly attached
to the backbone to enhance the representation of receptive fields. We have also utilized
a modified Path Aggregation Network (PANet) to efficiently preserve fine-grain localized
information by feature fusion. Additionally, we performed an extensive ablation study for
backbone-neck architecture to optimize both accuracy of detection and detection speed. The
proposed WilDect-YOLO has been employed to detect distinct eight different endangered wildlife
species that provide superior and accurate detection under various complex and challenging
environments. The WilDect-YOLO effectively addresses the shortcoming of existing DL-based
wildlife detection models and illustrates the superior potential in real-time in-field applications.
In short, current work constitutes a step toward a non-invasive, fully automated efficient animal
observation system.
2. Related Works :
In the present section, some recent and relevant works have been highlighted. More recently, a
two-channeled perceiving residual pyramid network (Ruff et al.,2021) has been proposed based
on audio signals that deliver superior detection accuracy. Furthermore, different techniques
such as segmentation-based YOLO model (Parham et al.,2018), fast-depth CNN-based
6
detection model from highly cluttered camera images (Singh et al.,2020), sparse multi
discriminative-neural network (SMD-NN) (Meena and Loganathan,2020), a fast image-enhancement
algorithm based on Multi-Scale Retinex (MSR) (Zotin and Proskurin,2019), CNN-based
model for facial detection (Taheri and
¨
Onsen Toygar,2018), a semi-supervised learning-based
Multi-part CNN (MP-CNN) (Divya Meena and Agilandeeswari,2019), CNN with k-Nearest
Neighbor (kNN) has been utilized for wildlife detection that provides state-of-the-art performance.
In terms of endangered animal detection, there is only a handful of work that has been
geared toward addressing such an important issue. Notably, the DL-based model for classifying
red pandas (He et al.,2019); animal action recognition based on wildlife videos (Schindler and
Steinhage,2021) are some of the representative works in recent endeavors. Additionally, RGB
and thermal image-based Arctic bird detection using drones has been developed in (Lee et al.,
2019). After reviewing the aforementioned methods which are geared towards endangered
wildlife detection, the current works aim to develop an efficient and robust endangered wildlife
classification and accurate object localization model simultaneously productive in terms of
training time and computational cost which is currently lacking in the recent state-of-the-art
endeavors.
3. Endangered wildlife species dataset :
Since there is no publicly available endangered wildlife dataset, in the present work, we
have extensively collected high-resolution web-harvested images for different endangered species
under various complex backgrounds. The dataset used for the experimentation comprises
eight classes: Polar Bear (Ursus maritimus) , Gal´apagos Penguin (Spheniscus mendiculus),
Giant Panda (Ailuropoda melanoleuca), Red Panda (Ailurus fulgens), African forest elephant
(Loxodonta cyclotis), Sunda Tiger (Panthera tigris sondaica), Black Rhino (Diceros bicornis),
and African wild Dog (Lycaon pictus). Fig. 1shows some of the representative images from
the custom dataset for the eight different classes considered herein. Noteworthy to mention,
categories including Gal´apagos Penguin, Red Panda, African forest elephant, Sunda Tiger,
Black Rhino and African wild Dogs have been declared critically endangered species. In the
datasets, there are a total number of 1600 images of which there are 200 images for each class.
7
Figure 1: (a) Representative samples images from endangered wildlife dataset that consist of
eight classes: (a) Polar Bear; (b) Gal´apagos Penguin; (c) Giant Panda; (d) Red Panda; (e)
African forest elephant; (f) Sunda Tiger; (g) Black Rhino; and (h) African wild Dog
For the variability and challenges in the datasets, we have included images that characterize
limited and/or full illumination, low visibility, high degree of occultation, multiple objects
with overlap, complex backgrounds, the textural similarity of the object and the background,
and noisy environment. Additionally, the images of the dataset have variations in their scale,
orientation, and resolution.
4. Proposed Methodology for object localization:
In object detection, the target object classification and localization are performed simultaneously
where the target class has been categorized and separated from the background by drawing
bounding boxes (BBs) on input images containing the entire object. This can be particularly
useful for counting endangered species for accurate surveying. To this end, the main goal of the
current work is to develop an accurate and robust endangered wildlife localization model. In
this regard, different variants of YOLO (Redmon et al.,2016;Redmon and Farhadi,2017,2018;
8
η
d
wp
hp
wgt
hgt
(a)
(b)
Wildlife detection
Input N ×N grids
BBs+ confidence score
Class probability
Figure 2: Schematic of (a) YOLO object localization process for endangered wildlife detection;
(b) offset regression process for target BBs prediction during CIoU loss.
Bochkovskiy et al.,2020) are some of the best high-precision one-stage object detection models
that consist of the following parts: a backbone for semantic deep feature extraction, followed by
the neck for hierarchical feature fusion, and finally detection head for object classification and
localization. The overall schematic of the YOLO object localization process has been depicted
in Fig. 2where the YOLO algorithm transforms the object detection task into a regression
problem by generating BBs coordinates and probabilities for each class. During the process,
the inputted image size has been uniformly divided into
N×N
grids where
B
predictive BBs
have been generated. Subsequently, a confidence score has been assigned if the target object
falls inside that particular grid. It detects the target object for a particular class when the
center of the ground truth lies inside a specified grid. During detection, each grid predicts
NB
numbers of BBs with the confidence value ΘBas:
ΘB=Pr(obj)×IoUt
p∨ Pr(obj)∈0,1 (1)
where
Pr
(
obj
) infers the accuracy of BB prediction, i.e.,
Pr
(
obj
) = 1 indicates that the target
class falls inside the grid, otherwise,
Pr
(
obj
) = 0. The degree of overlap between ground truth
and the predicted BB has been described by the scale-invariant evaluation metric intersection
9
over union (IoU) which can be expressed as
IoU = Bp∩Bt
Bp∪Bt
(2)
where B
t
and B
p
are the ground truth and predicted BBs, respectively. However, to further
improve BBs regression and gradient disappearance, generalized IoU (GIoU) (Rezatofighi et al.,
2019) and distance-IoU (DIoU) (Zheng et al.,2020) as been introduced considering aspect ratios
and orientation of the overlapping BBs. More recently, complete IoU (CIoU) (Zheng et al.,
2020) has been proposed for improved accuracy and faster convergence speed in BB prediction
which can be expressed as
LCIoU = 1 + βξ +α2(bp,bt)
η2−IoU (3)
ξ=4
π2tan−1wt
ht
−tan−1wp
hp2
;β=ξ
(1 −IoU) + ξ0(4)
where
bgt
and b
p
denotes the centroids of B
gt
and B
p
, respectively;
ξ
and
β
are the consistency
and trade-off parameters, respectively. As shown in Fig. 2-(b),
η
is the smallest diagonal
length of B
p∪
B
t
;
wgt
,
wp
are widths and
hgt
,
hp
are heights of B
gt
and B
p
, respectively.
With increasing
wp/hp
, we get
ξ→
0 from Eq. 4. Therefore, to optimize the influence of
ξ
on
the CIoU,
wp/hp
can be properly chosen for the YOLO model. Finally, the best BB prediction
can be obtained from the non-maximum suppression (NMS) (Ren et al.,2016) algorithm from
multiple scales.
4.1 WilDect-YOLO architecture:
In recent endeavors, various attempts have been made on computer vision-based object detection
algorithm for accurate wildlife detection and survey utilizing deep CNN (Kellenberger et al.,
2019), R-CNN (Ibraheam et al.,2021), Faster R-CNN (Peng et al.,2020), single shot multi-box
detector (SSD) (Saxena et al.,2021), and YOLO (Choe and Kim,2020). Although the
aforementioned techniques have demonstrated outstanding performance, however, the detection
of endangered wildlife detection task, specifically in Polar and African regions, faces several
10
specific challenges, in particular, significant variability of lightening conditions, low visibility,
high degree of osculation and overlap, the coexistence of multiple target classes with various
aspect ratios, visual similarities, complex backgrounds, and the low distinguishable interface
between species and its surroundings. Such challenging conditions lead to false object prediction
with a large number of missed detection from the original YOLOv4 (Bochkovskiy et al.,2020)
due to its insufficient fine-grain feature extraction capabilities.
To resolve the existing issues, in the current work, we propose a novel object localization
algorithm WilDect-YOLO based on a state-of-the-art YOLOv4 network, specially designed
for endangered wildlife detection, to enhance feature extraction, preserve fine-grain localized
information and improve feature fusion that provides superior detection under various challenging
environments. The model has been optimized to achieve better efficiency and accuracy of BB
prediction based on the characteristics and complexities of the endangered wildlife dataset
considered herein. The overall network of the object localization model is shown in Fig.
3. To improve performance in terms of classification accuracy and object localization, we
Input: (416, 416, 3)
Down sample:
Up sample:
Concatenate:
DS
13×13×24
26×26×24
52×52×24
Detection
US
Class Loss
CIoU Loss
Confidence Loss
C
CSPX2×3
C
CSPX2-3
Dense-CSPDarknet53
CSPX2×3
Modified PANet
Head
DS
DS
CSP1
CSP2
CSP8
D-CSPX1-4
D-CSPX1-2
CBH
Conv2D
CBH
Conv2D
CBH
CBH
C
CBH
C
CBH
US
CBH
CBH
US
C
CBH
Conv2D
CBH
CSP1
CSP2
CSP4
CSPX1-3
CSPX1×3
Dense B-2
CSPX1-2
CSPX1-4
CSPX2-3
CSPX2×3
CSPX2-3
CSPX2×3
C
52×52×24
26×26×24
CSPX2-3
CBL
SPP
MaxPool (5)
MaxPool (9)
MaxPool (13)
CBH
Dense B-1
CBH
CSP8
CSP2
CSP1
Figure 3: Schematic of the proposed WilDect-YOLO consists of improved Dense-CSPDarknet53
with residual block CSPX1-
n
and SPP in the backbone, modified PANet in the neck part with
regular YOLO head.
11
perform extensive experiments, and various modifications are proposed which are detailed in
the subsequent sections.
4.2 Improvement of discriminative feature extraction:
In the present study, we have introduced a residual block CSPX1-
n
where
n
represents
residual weighting operations to improve detection speed and performance. We integrate
CSPX1-
n
modules in the CSPDarknet53 backbone replacing the original CSP8 and CSP4
residual blocks to extract fine-grained rich semantic information as shown in Fig. 3. In the
CSPX1-
n
block, we divide the input features into two parts. In the first part, (3
×
3) convolution
was performed followed by an additional (3
×
3) convolution to maintain the number of feature
maps after entering the next residual unit as shown in Fig. 4-(a). To further improve the
feature extraction, we perform 3
×
3 convolution at the end. Whereas, the second part acts as
a residual edge for the convolution. These two parts have been concatenated at the end to
improve the semantic feature information. Implementation of the CSPX1-
n
modules in the
improved CSPDarknet53 helps to learn more expressive features that demonstrate significant
improvement of detection accuracy for the custom wildlife datasets used herein.
4.3 Preserving critical feature information:
To preserve critical feature maps and efficiently reuse the discriminative feature information,
we have fused DenseNet (Huang et al.,2017) in the original CSPDarknet53. In DenseNet,
each layer has been connected to other layers in a feed-forward mode where
n
-th layer can
receive the important feature information
Xn
from all the previous layers
X0, X1, ..., Xn−1
as
Xn
=
Hn
[
X0, X1, ..., Xn−1
] where
Hn
is the feature map function for
n
-th layer. The schematic
of the DenseNet blocks network structure have been shown in Fig. 4-(b, c). As shown in Fig. 3,
we have introduced two DenseNet blocks; the first block (Dense B-1) has been attached before
cross-stage partial block CSPX1-4; whereas the second block (Dense B-2) has been placed
before CSPX1-2 in the proposed WilDect-YOLO network which results in enhance feature
propagation. It has been found that DenseNet significantly improves the feature transfer and
12
26×26×24
Res Unit
CSPX2-n
(Res Unit) ×n
Part I
Part II
CSPX2-n Block
C
CBH
BN
L-ReLU
Conv2D
Conv2D
X1
H1
Input: (26×26×256)
Output:
( 26×26×512)
X0
X2
X3
X4
H2
H3
H4
X1
Transition
Layer
( 26×26×320)
( 26×26×384)
( 26×26×448)
Transition
Layer
Output:
( 13×13×1024)
( 13×13×896)
( 13×13×768)
Input: (13×13×512)
( 13×13×640)
X4
X3
X2
X1
X0
H4
H3
H2
H1
Dense Block -1
Dense Block -2
(a)
(b)
Res Unit
CSPX1-n
(CBH) ×3
Part I
Part II
CSPX1-n Block
C
CBH
CBH
Conv2D
Conv2D
BN
L-ReLU
CBH
(b)
(c)
CBH
(d)
Figure 4: Schematic of (a) CSPX1-
n
residual block; (b) dense block (DB)-1; (c) dense block
(DB)-2; (d) CSPX2-nresidual block architecture used in WilDect-YOLO detection model.
13
mitigates over-fitting in the proposed detection network. Additionally, by reducing redundant
feature operations, such implementation improve the computational speed.
4.4 Receptive field enhancement:
One of the requirements of CNN is to have fixed-size input images. However, due to the
different aspect ratios of the images, they have been fixed by cropping and warping during the
convolution process which results in losing important features. In this regard, SPP (He et al.,
2015) applies an efficient strategy in detecting target objects at multiple length scales. To
this end, we have added an SPP block integrated with CSPX1-2 of the Dense-CSPDarknet53
backbone to improve receptive field representation and extraction of important contextual
features as shown in Fig. 4. In the proposed model, a modified SPP consisting of various sizes
of sliding kernels (i.e., 5
×
5, 9
×
9, and 13
×
13 ) with maximum pooling has been prescribed
that effectively increases the receptive field representation of the backbone.
4.5 Preserving fine-grain localize information:
In addition, an improved PANet (Liu et al.,2018) integrated with CSPX2-
n
has been utilized
as a neck of the detection model as shown in Fig. 2. It can efficiently combine high and low
feature fusion for multi-scale feature pyramid maps preserving fine-grain localized information.
Additionally, by employing flexible ROI pooling and element-wise max operation, PANet can
efficiently fuse the information from previous feature layers resulting in significant improvement
in the detection accuracy of the model.
Furthermore, CIoU loss function (Zheng et al.,2020), dropblock regularization (Ghiasi et al.,
2018), Cross mini Batch Normalization (Yao et al.,2021), dropout in feature map (Srivastava
et al.,2014), and cosine annealing scheduler (Loshchilov and Hutter,2017) have been employed
to further improve the performance of WilDect-YOLO. We use the original YOLOv3 head in
the final part of the detection network. Utilizing 416
×
416
×
3 image size as the input, the
detection head of the WilDect-YOLO can predict BBs in three different scales: (13
×
13
×
24),
(26
×
26
×
24), and (52
×
52
×
24) as shown in Fig. 2. After extensive experiments, we have
14
found that Mish (Misra,2020) activation provides the optimal performance in terms of model
accuracy. Overall, our proposed methodology provides the best results in terms of accuracy
and performance compared to current state-of-the-art models for endangered wildlife detection
(see Section 6.2 )
5. Training and performance :
5.1 Training procedure :
In the present work, we have performed an extensive and elaborate study to explore the
comparative performance analysis of the proposed WilDect-YOLO models for endangered
wildlife classification and object localization. From the initial custom endangered wildlife
species dataset consisting of 1,600 images has been further expanded tenfold by utilizing various
data augmentation procedures (i.e., color balancing, rotation, blur processing, mirror projection,
brightness transformation) to obtain the final dataset of a total of 16,000 images (2,000 images
per class). From the final dataset, a total of 60%, 20%, and 20% images have been randomly
chosen for training, validation, and test sets, respectively. For the training set, LabelImg
(Tzutalin,2015) has been used for the annotation of BBs around the target classes. For all
the experiments, we have used a Windows 10 Pro (64-bit) based computational system that
has Intel Core i5-10210U with CPU @ 2.8 GHz
×
6, 32 GB DDR4 memory, NVIDIA GeForce
RTX 2080 utilizing CUDA 10.2.89 and cuDNN 10.2 v7.6.5 for GPU parallelization. As required
CV libraries, Visual Studio v15.9 (2017), and OpenCV 4.5.1-vc14 have been integrated with
DarkNet. Unless otherwise stated, a batch size set to 32 with a total number of training steps
has been kept as 85,000 during training. The initial learning rate has been set to 0.001. The
training dataset has been trained utilizing the available pre-trained weights-file (AlexeyAB,
2021). Various training hyperparameters for WilDect-YOLO have been detailed in Table 1.
5.2 Performance metrics:
In the present work, the performance of the object detection models has been evaluated
15
Table 1: Various hyparameters values for training the WilDect-YOLOv model
Image size Sub-division Batch Channels Decay
416 ×416 ×3 8 32 6 0.005
Initial learning rate Momentum Classes Training steps Filters
0.001 0.9 8 85,000 36
by common standard measures (Ferri et al.,2009) including average precision (AP), precision
(P), recall (R), IoU, F-1 score, mean average precision (mAP), etc. The confusion matrix
obtained from the evaluation procedure provides the following interpretations of the test results:
true positive (TP), false positive (FP), false negative (FN), and true negative (TN). During
binary classification, the classified object can be defined as TP for IoU
≥
0
.
5. Whereas, it can
be classified as FP for IoU
<
0
.
5. Based on the aforementioned interpretations, the metric P of
the classifier can be defined by its ability to distinguish target classes correctly as :
P=T P
(T P +F P ); (5)
The ratio of the correct prediction of target classes is called R of the classifier which can be
evaluated as:
R=T P
(T P +F N )(6)
The higher values of P and R indicate superior detection capability. Whereas, the F-1 score is
the arithmetic mean of the P and R given as :
F1−score = 2P×R
P+R.(7)
A relatively high F1 score represents a robust detection model. The performance metrics AP
can be defined as the area under a P-R curve (Davis and Goadrich,2006) as follows
AP =Z1
0
P(R) dR. (8)
16
A higher average AP value indicates better accuracy in predicting various object classes. In
addition,
AP50:95
denotes AP over IoU=0
.
50 : 0
.
05 : 0
.
95; AP
50
and AP
75
are APs at IoU
threshold of 50% and 75%, respectively. The AP for detecting small, medium, and large objects
can be measured through AP
S
, AP
M
, and AP
L
, respectively. Finally, mAP can be obtained
from the average of all APs as:
mAP =1
Nc
N
X
i=1
APi.(9)
6. Results:
In this section, the performance and detection accuracy of the proposed WilDect-YOLO
frameworks have been discussed which have been evaluated in a custom-made endangered
wildlife dataset consisting of 8 classes. For better clarity in BBs representation, the following
BB class identifiers have been associated in the detection results: class 1- Polar Bear; class 2-
Gal´apagos Penguin; class 3- Giant Panda; class 4- Red Panda; class 5- African forest elephant;
class 6- Sunda Tiger; class 7- Black Rhino; and class 8- African wild Dog. The performance of
the WilDect-YOLO network has been optimized through extensive ablation studies. Finally,
the performance of the proposed model has been studied in detail and compared with several
state-of-the-art object detection models.
6.1 Optimization of network performance:
At first, we conduct extensive experiments to select proper backbone-neck combinations
to optimize the performance of the proposed WilDect-YOLO model in terms of both detection
accuracy and speed. For different combinations of backbone-neck configurations, detection
accuracy in terms of parameters AP, AP
50
, AP
75
, AP
S
, AP
M
, and AP
L
as well as detection
speed (in FPS) has been reported in Table. 2. For the comparison, we select Mish as the
activation function. From the Table. 2, one can see that DenseNet blocks in CSPDarknet53
(i.e., D-CSPDarknet-53) improve the accuracy of the detection model compared to the original
17
Table 2: Performance of various residual and dense block combinations in WilDect-YOLO
architecture for anchors size of 416 ×416.
Backbone
+ add-in
Neck
+add-in
AP AP50 AP75 APSAPMAPLFPS
CSPDarknet53 PANet 76.8 93.6 92.5 80.9 89.2 80.9 59.6
D-CSPDarknet53 PANet 78.4 96.1 92.2 78.3 87.7 81.7 61.1
D-CSPDarknet53+CSPX1-nPANet 79.5 96.1 92.5 77.9 88.2 82.9 60.1
CSPDarknet53 PANet+CSPX2-n77.1 95.6 91.2 74.1 87.9 84.7 63.2
D-CSPDarknet53+CSPX1-nPANet+CSPX2-n81.7 96.9 92.3 87.8 92.5 88.5 59.2
YOLOv4. The performance is further improved by introducing CSPX1-
n
into D-CSPDarknet53.
However, such a configuration results in a slight decrease in detection speed. We observe
that the best performance has been achieved when both CSPX1-
n
and CSPX2-
n
have been
integrated into D-CSPDarknet53 and PANet, respectively. There is a significant improvement
in the accuracy parameter, in particular, AP, AP
S
, and AP
L
increase by 4.9%, 6.9%, and 7.6%,
respectively compared to CSPDarknet53+PANet configuration. Thus, a such configuration
in WilDect-YOLO provides the optimal performance in terms of detection accuracy and
speed for the custom wildlife species data set considered herein. In summary, together with
proper activation function and improved backbone-neck combination provide an efficient
high-performance model for wildlife detection in complex scenarios.
6.2 Comparison with existing state-of-the-art models:
In this section, the detection performance of WilDect-YOLO is compared with some of the
existing state-of-the-art detection models (Zhao et al.,2019b). For the performance comparison,
we consider Faster R-CNN (Ren et al.,2016), Mask R-CNN He et al. (2017), RetinaNet
(Lin et al.,2017b), SSD Liu et al. (2016), YOLOv3 (Redmon and Farhadi,2018), YOLOv4
(Bochkovskiy et al.,2020), and Dense-YOLOv4 (Roy and Bhaduri,2022) that are trained
in the custom wildlife dataset in OpenMMLab object detection toolbox Chen et al. (2019).
Comparison of different performance parameters including P, R, F1-score, mAP, and detection
18
Table 3: Comparison of different performance parameters including P, R, F1, mAP, and
detection speed (in FPS) between WilDect-YOLO and other state-of-the-art models where bold
highlights the best performance values.
Model P (%) R (%) F1-score (%) mAP (%) Dect. time (ms) FPS
Faster R-CNN 71.32 72.39 71.85 73.17 41.12 24.32
RetinaNet 75.11 77.67 76.36 77.11 32.89 30.40
SSD 76.13 80.19 78.10 80.52 28.22 35.43
Mask R-CNN 78.22 83.35 80.70 81.61 50.72 19.72
YOLOv3 83.61 87.47 85.49 86.61 25.11 39.82
YOLOv4 90.19 93.79 91.95 91.29 17.21 58.10
Dense-YOLOv4 93.53 96.42 94.95 93.61 16.77 59.63
WilDect-YOLO 97.18 98.56 97.87 96.89 16.89 59.20
speed obtained from these models have been shown in Table 3. The comparison reveals
that the accuracy of R-CNN, RetinaNet, SSD, and Mask R-CNN is quite inferior compared
to YOLO variants as visually illustrated in the bar-chart plot in Fig. 5. Between YOLOv3
and YOLOv4, YOLOv4 demonstrated better performance with a 6
.
46% increase in F1 and
4
.
68% increase in mAP, respectively. We observe that the performance of Dense-YOLOv4 is
superior to the original YOLOv4 with 3
.
34%, 2
.
63%, 3
.
01%, and 2
.
32% increase in P, R, F1,
and mAP, respectively. However, WilDect-YOLO yields the best performance reaching the
values of 97
.
18%, 98
.
56%, 97
.
87%, and 96
.
89% in P, R, F1, and mAP, respectively as shown
in Fig.5. Moreover, WilDect-YOLO provides a superior real-time detection speed of 59.21
FPS which is 3
.
34% higher than the original YOLOv4 model. In summary, WilDect-YOLO
outshines some of the best detection models in terms of both detection accuracy and speed
suitable for automated high-performance wildlife detection models.
6.3 Overall performance of WilDect-YOLO:
From the previous section, it has been observed that YOLOv4, Dense-YOLOv4, and WilDect-YOLO
provide better performance compared to other state-of-the-art models. Therefore, these three
19
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5
F R-CNN
RN
SSD
M R-CNN
Y3
Yv4
D-Yv4
WD-Y
Figure 5: Comparison bar chart of different performance parameters including P, R, F1-score,
mAP, and detection speed (in FPS) between WilDect-YOLO and other state-of-the-art models.
Table 4: Overall performance comparison between original YOLOv4, Dense-YOLOv4, and
WilDect-YOLO.
Detection model IoU F1 mAP Validation loss Detection
time (ms)
Detection
speed (FPS)
YOLOv4 0.810 0.919 0.913 12.07 17.21 58.10
Dense-YOLOv4 0.881 0.949 0.936 5.31 16.77 59.63
WilDect-YOLO 0.917 0.979 0.969 1.88 16.89 59.21
models are closely compared in terms of mAP, F1, IoU, final loss, and average detection
time as shown in Table 4. The proposed WilDect-YOLOv has achieved the highest average
IoU value of 0.917 indicating superior BB accuracy during target detection compared to the
other two models. Similarly, it has also illustrated better detection performance and accuracy
by achieving the highest F1 and mAP values of 97
.
9% and 96
.
9% which are 6
.
1% and 5
.
6%
improvement over the original YOLOv4, respectively. Furthermore, the detection speed of
59.21 FPS obtained from WilDect-YOLO was found to be higher than YOLO and slightly less
than Dense-YOLOv4. Thus, it can provide real-time detection of wildlife species with better
accuracy compared to the other two models. In addition, the comparison of P-R curves between
the three models have been depicted in Fig 6-(a). From the comparison of the P-R curves, one
20
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1
YOLOv3
YOLOv4
Improved YOLOv4
0
20
40
60
80
100
120
140
0 1 2 3 4 5 6 7 8 9
yolov3
YOLOv4
Improved YOLOv4
Figure 6: Comparison of (a) P-R curves; (b) loss evolution curves between original YOLOv4,
Dense-YOLOv4, and WilDect-YOLO.
can see that WilDect-YOLO attains a better P value for a particular R. It achieved the highest
area under the P-R curve indicating superior detection performance compared to YOLOv4
and Dense-YOLOv4. Next, we compare the loss evolution curves as shown in Fig 6-(b). In
the initial phase, after exhibiting several cycles of fluctuation, the loss in the WilDect-YOLO
model tends to saturate after approximately 20,000 training steps with a final loss value of
1.88. Whereas, the other two models exhibit higher fluctuation in loss evolution and yield
higher final loss value. Evidently, the proposed WilDect-YOLO is easier to train with faster
convergence characteristics demonstrating its efficacy from the computational point of view.
To further gain insight into the performances of these models, detection result containing
TP, FP, and FN for each class and corresponding P, R, and F-1 values from Dense-YOLOv4
and WilDect-YOLO has been shown in Table 5. WilDect-YOLO has illustrated significant
improvement in P and R values for various classes, in particular, for detecting Galapagoes
Penguine, African Elephant, and Black Rhino classes. WilDect efficiently maximizes the TP
value while simultaneously reducing FP and FN values for all classes. The proposed model
improves 3
.
65% in P and 2
.
14% in R compared to Dense-YOLOv4. From the overall comparison,
we can conclude that WilDect-YOLO demonstrated the best performance in detecting various
endangered wildlife species outperforming both YOLOv4 and Dense-YOLOv4 in terms of
21
Table 5: Comparison of detection results for individual classes between Dense-YOLOv4 and
WilDect-YOLO
Model Class Objects TP FP FN P (%) R (%) F1-score
WilDect-YOLO
All 10070 9694 281 141 97.18 98.56 97.87
Polar Bear 675 656 12 8 98.20 98.79 98.49
Galap. Penguine 1453 1398 33 27 97.69 98.10 97.89
Giant Panda 1211 1201 23 11 98.12 99.09 98.60
Red Panda 789 756 12 09 98.43 98.82 98.63
African Elephant 1987 1878 89 32 95.47 98.32 96.87
Sunda Tiger 987 923 44 19 95.44 97.98 96.70
Black Rhino 1001 981 23 12 97.70 98.79 98.24
Wild Dog 1967 1901 45 23 97.68 98.80 98.24
Dense-YOLO4
All 10070 9291 642 345 93.53 96.42 94.95
Polar Bear 675 621 37 22 94.37 96.58 95.46
Galap. Penguine 1453 1378 39 32 97.24 97.73 97.48
Giant Panda 1211 1118 87 52 92.78 95.56 94.14
Red Panda 789 740 29 26 96.22 96.60 96.41
African Elephant 1987 1801 177 78 91.05 95.84 93.38
Sunda Tiger 987 901 76 32 92.22 96.57 94.34
Black Rhino 1001 921 87 36 91.36 96.23 93.74
Wild Dog 1967 1811 110 67 94.27 96.43 95.34
precision and accuracy values.
6.4 Detection of various animal species:
In this section, we have demonstrated the detection results for eight different classes of
endangered animal species from the proposed WilDect-YOLO and compared them with
Dense-YOLOv4. The visual representations of the detection results have been presented
with confined BBs considering complex backgrounds and challenging environments as shown
in Figs. 7-10. Corresponding detailed detection results consisting of the number of detected
and undetected target classes with average confidence scores have been reported in Table.
6. In Fig. 7, we tested the model for detecting Polar Bears and Galapagos Penguins in a
challenging scenario where the target objects have been placed in a similar textured background.
The proposed model shows its efficacy by preciously detecting the target objects with high
average confidence index values. In separate cases, we have considered detection for Giant
Panda and Red Panda classes where multiple target objects have a significant degree of overlap
23
Table 6: Detailed detection results from WilDect-YOLO and Dense-YOLOv4 for different
classes as shown in Figs. 7-10.
Species Figs. No Model Detc. Undetc. Avg. confidence Score
Polar Bear 7(a)-(c) WilDect-YOLO 10 0 0.96
Polar Bear 7(a-i)-(c-i) Dense-YOLOv4 10 0 0.91
Galap. Penguine 7(d)-(f) WilDect-YOLO 16 0 0.93
Galap. Penguine 7(d-i)-(f-i) Dense-YOLOv4 13 3 0.88
Giant Panda 8(a)-(c) WilDect-YOLO 18 1 0.94
Giant Panda 8(a-i)-(c-i) Dense-YOLOv4 14 5 0.83
Red Panda 8(d)-(f) WilDect-YOLO 7 0 0.98
Red Panda 8(d-i)-(f-i) Dense-YOLOv4 7 0 0.93
African Elephant 9(a)-(c) WilDect-YOLO 14 0 0.92
African Elephant 9(a-i)-(c-i) Dense-YOLOv4 10 4 0.83
Sunda Tiger 9(d)-(f) WilDect-YOLO 8 0 0.99
Sunda Tiger 9(d-i)-(f-i) Dense-YOLOv4 7 1 0.92
Black Rhino 10 (a)-(c) WilDect-YOLO 7 0 0.98
Black Rhino 10 (a-i)-(c-i) Dense-YOLOv4 6 1 0.91
Wild Dog 10 (d)-(f) WilDect-YOLO 16 0 0.91
Wild Dog 10 (d-i)-(f-i) Dense-YOLOv4 11 5 0.77
25
55
5555
5
55
5
5
555
55
5
5
5
5555
66
6
66
666
6
66
66
Figure 9: Detection results for African Elephant (class-5) and Sunda Tiger (class-6) from
the proposed WilDectYOLO and Dense-YOLOv4. Detailed detection results with average
confidence scores have been shown in Table 6.
between them. From the detection result, one can see that the bounding box prediction from
the proposed WilDect-YOLO is quite accurate in detecting each target object as illustrated
in Fig. 8. In Fig. 9, we have extended the detection for African Elephant and Sunda Tiger
classes where the target class is placed in a complex and challenging background. Detection
results from WilDect-YOLO in terms of boundary box precision are more accurate compared to
Dense-YOLOv4 as shown in Table. 6. To further illustrate the efficacy of the WilDect-YOLO
detection performance, we have considered the detection of Black Rhino and African Wild Dog
cases that have a high degree of occlusion, and dense overlapping between object classes. This
is quite a challenging task to detect target objects individually. In such cases, the detection
results from WilDect-YOLO elucidate superior detection accuracy by preciously detecting each
target class with high confidence index as shown in Figs. 10.
Additionally, for poorly visible multiple target objects due to insufficient lightening conditions,
the proposed localization algorithm performs well without missed detection as demonstrated in
Figs. 7-10. For high-aspect-ratio object detection cases with the presence of irregular shapes and
the similarity of their texture with surrounding environments, the proposed the model yields
26
7
7
7
7
7
88
8
8
888
8
8
7
7
7
7
77888
7
7
88
88
88888
8
Figure 10: Detection results for Black Rhino (class-7) and African wild Dog (class-8) from
the proposed WilDectYOLO and Dense-YOLOv4. Detailed detection results with average
confidence scores have been shown in Table 6.
good performance in such challenging scenarios. The overall detection result illustrates accurate
and robust bounding box prediction from WilDect-YOLO for all target classes compared to
Dense-YOLOv4.
7. Discussion :
The current study proposes an efficient automated detection framework for the endangered
wildlife species which can be deployed for animal surveys in various demographic regions without
human intervention. Thus, it can significantly reduce the cost of operation, manual equipment,
and overcome the difficulties of working in these adverse weather conditions. The current
framework illustrates its superior capability of detecting various endangered animals which
are significantly different in terms of body textures, shapes, sizes, colors, and morphological
characteristics. Furthermore, in the presence of various detection challenges such as visual
similarities, complex backgrounds, a high degree of occultation and overlap, and the low
27
distinguishable interface between species and its surroundings, the proposed model can replace
current state-of-the-art detection models in terms of accuracy and robustness. Additionally,
the current deep learning framework can be extended to UAS imagery to further expand the
capability of detecting various wildlife animals. With improved feature extraction capability and
an efficient localization algorithm, the proposed model can be suitable for detecting small-size
animals from relatively low-resolution images as well as satellite imagery. Although, the present
work focus on endangered animal detection, however, the current framework can be extended to
more generalized automated animal species detection for comprehensive and systematic wildlife
animal surveys. Furthermore, the current work can be integrated with geographic information
systems (GIS) for analyzing the migrations and activities of wild animals. Moreover, one
of the potential applications can be assembling object detection framework with semantic
segmentation methods such as Mask R-CNN (Bharati and Pramanik,2020), U-Net (Esser et al.,
2018) to extract additional physical information such as diseases, body fat, height as well as
various animal activities including eating, running, and resting which can be helpful in better
understanding animal health and habits (Norouzzadeh et al.,2018). Nevertheless, the current
deep-learning model outshines classical automated image analysis and various state-of-the-art
approaches in wildlife animal detection indicating future improvements in performance and
usability for the precise and accurate endangered animal survey which can be applied to
various automated wildlife monitoring (Desgarnier et al.,2022;Hou et al.,2020;Chen et al.,
2020;Mannocci et al.,2021;Arbieu et al.,2021) and different biological conservation purposes
(Stern and Humphries,2022). The current framework can also be extended for various fault
detection/thermal imaging(Glowacz,2021b,a,c), human activity recognition (Xiao et al.,2021b)
etc.
8. Conclusions :
Summarizing, in the present work, we have developed an efficient and robust object localization
algorithm WilDect-YOLO is based on computer vision for accurate classification and localization
of various endangered wildlife species. In the proposed network, we integrate DenseNet blocks
to improve feature critical feature information and two new residual blocks for efficient deep
28
spatial feature extraction. In addition, SPP and improved PANet modules have been employed
to efficiently preserve fine-grain localized information by feature fusion. Evaluated on a
custom-made dataset for endangered wildlife species, it has been found that at a detection rate
of 59.20 FPS, WilDect-YOLO has achieved mAP, F1-score, and precision values of 96
.
89%,
97
.
87%, and 97
.
18%, respectively outperforms existing state-of-the-art wildlife detection models
in terms of both classification accuracy and localized bounding box prediction in detecting
various wildlife spices. Current work effectively addresses the shortcoming of existing deep
learning-based wildlife detection models and constitutes a step toward a fully automated
accurate automated wildlife monitoring system in real-time in-field applications.
Acknowledgements: The support of the Aeronautical Research and Development Board
(Grant No. DARO/08/1051450/M/I) is gratefully acknowledged.
Conflict of interest: The authors declare that they have no known competing financial
interests or personal relationships that could have appeared to influence the work reported in
this paper.
Data availability: The data that support the findings of this study are available upon
reasonable request.
29
References
Aebischer, T., Siguindo, G., Rochat, E., Arandjelovic, M., Heilman, A., Hickisch, R., Vigilant,
L., Joost, S., and Wegmann, D. (2017). First quantitative survey delineates the distribution
of chimpanzees in the eastern central african republic. Biological Conservation, 213:84--94.
AlexeyAB (2021). Pre-trained weights-file.
Arbieu, U., Helsper, K., Dadvar, M., Mueller, T., and Niamir, A. (2021). Natural language
processing as a tool to evaluate emotions in conservation conflicts. Biological Conservation,
256:109030.
Austrheim, G., Speed, J. D., Martinsen, V., Mulder, J., and Mysterud, A. (2014). Experimental
effects of herbivore density on aboveground plant biomass in an alpine grassland ecosystem.
Arctic, Antarctic, and Alpine Research, 46(3):535--541.
Barbedo, J. G. A., Koenigkan, L. V., Santos, T. T., and Santos, P. M. (2019). A study on the
detection of cattle in uav images using deep learning. Sensors, 19(24):5436.
Bharati, P. and Pramanik, A. (2020). Deep learning techniques—r-cnn to mask r-cnn: a survey.
Computational Intelligence in Pattern Recognition, pages 657--668.
Bochkovskiy, A., Wang, C.-Y., and Liao, H.-Y. M. (2020). Yolov4: Optimal speed and accuracy
of object detection.
Bose, R. and Roy, A. (2022). Accurate deep learning sub-grid scale models for large eddy
simulations. Bulletin of the American Physical Society.
Chabot, D., Stapleton, S., and Francis, C. M. (2019). Measuring the spectral signature of
polar bears from a drone to improve their detection from space. Biological Conservation,
237:125--132.
Chabot, D., Stapleton, S., and Francis, C. M. (2022). Using web images to train a deep neural
network to detect sparsely distributed wildlife in large volumes of remotely sensed imagery:
A case study of polar bears on sea ice. Ecological Informatics, page 101547.
30
Chalmers, C., Fergus, P., Curbelo Montanez, C. A., Longmore, S. N., and Wich, S. A.
(2021). Video analysis for the detection of animals using convolutional neural networks and
consumer-grade drones. Journal of Unmanned Vehicle Systems, 9(2):112--127.
Chandio, A., Gui, G., Kumar, T., Ullah, I., Ranjbarzadeh, R., Roy, A. M., Hussain, A., and
Shen, Y. (2022). Precise single-stage detector. arXiv preprint arXiv:2210.04252.
Chauvenet, A. L., Gill, R. M., Smith, G. C., Ward, A. I., and Massei, G. (2017). Quantifying
the bias in density estimated from distance sampling and camera trapping of unmarked
individuals. Ecological Modelling, 350:79--86.
Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J.,
Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J.,
Wang, J., Shi, J., Ouyang, W., Loy, C. C., and Lin, D. (2019). MMDetection: Open mmlab
detection toolbox and benchmark. arXiv preprint arXiv:1906.07155.
Chen, X., Zhao, J., Chen, Y.-h., Zhou, W., and Hughes, A. C. (2020). Automatic standardized
processing and identification of tropical bat calls using deep learning approaches. Biological
Conservation, 241:108269.
Cheng, G. and Han, J. (2016). A survey on object detection in optical remote sensing images.
ISPRS Journal of Photogrammetry and Remote Sensing, 117:11--28.
Choe, D.-G. and Kim, D.-K. (2020). Deep learning-based image data processing and archival
system for object detection of endangered species. Journal of information and communication
convergence engineering, 18(4):267--277.
Crooks, K., Burdett, C., Theobald, D., King, S., Marco, M. D., Rondinini, C., and Boitani, L.
(2017). Quantification of habitat fragmentation reveals extinction risk in terrestrial mammals.
Proceedings of the National Academy of Sciences, 114(29):7635--7640.
Davis, J. and Goadrich, M. (2006). The relationship between precision-recall and roc curves.
In Proceedings of the 23rd international conference on Machine learning, pages 233--240.
31
Delplanque, A., Foucher, S., Lejeune, P., Linchant, J., and Th´eau, J. (2021). Multispecies
detection and identification of african mammals in aerial imagery using convolutional neural
networks. Remote Sensing in Ecology and Conservation.
Desgarnier, L., Mouillot, D., Vigliola, L., Chaumont, M., and Mannocci, L. (2022). Putting eagle
rays on the map by coupling aerial video-surveys and deep learning. Biological Conservation,
267:109494.
Divya Meena, S. and Agilandeeswari, L. (2019). An efficient framework for animal breeds
classification using semi-supervised learning and multi-part convolutional neural network
(mp-cnn). IEEE Access, 7:151783--151802.
Duporge, I., Isupova, O., Reece, S., Macdonald, D. W., and Wang, T. (2021). Using
very-high-resolution satellite imagery and deep learning to detect and count african elephants
in heterogeneous landscapes. Remote sensing in ecology and conservation, 7(3):369--381.
Eikelboom, J. A., Wind, J., van de Ven, E., Kenana, L. M., Schroder, B., de Knegt, H. J., van
Langevelde, F., and Prins, H. H. (2019). Improving the precision and accuracy of animal
population estimates with aerial image object detection. Methods in Ecology and Evolution,
10(11):1875--1887.
Esser, P., Sutter, E., and Ommer, B. (2018). A variational u-net for conditional appearance and
shape generation. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 8857--8866.
Feng, J. and Li, J. (2022). An adaptive embedding network with spatial constraints for the
use of few-shot learning in endangered-animal detection. ISPRS International Journal of
Geo-Information, 11(4):256.
Ferri, C., Hern´andez-Orallo, J., and Modroiu, R. (2009). An experimental comparison of
performance measures for classification. Pattern recognition letters, 30(1):27--38.
Ghiasi, G., Lin, T.-Y., and Le, Q. V. (2018). Dropblock: A regularization method for
convolutional networks. Advances in neural information processing systems, 31.
32
Girshick, R. (2015). Fast r-cnn in proceedings of the ieee international conference on computer
vision (pp. 1440--1448). Piscataway, NJ: IEEE.[Google Scholar].
Glowacz, A. (2021a). Fault diagnosis of electric impact drills using thermal imaging.
Measurement, 171:108815.
Glowacz, A. (2021b). Thermographic fault diagnosis of ventilation in bldc motors. Sensors,
21(21):7245.
Glowacz, A. (2021c). Ventilation diagnosis of angle grinder using thermal imaging. Sensors,
21(8):2853.
Gon¸calves, B. C., Spitzbart, B., and Lynch, H. J. (2020). Sealnet: A fully-automated pack-ice
seal detection pipeline for sub-meter satellite imagery. Remote Sensing of Environment,
239:111617.
Gonzalez, L. F., Montes, G. A., Puig, E., Johnson, S., Mengersen, K., and Gaston, K. J. (2016).
Unmanned aerial vehicles (uavs) and artificial intelligence revolutionizing wildlife monitoring
and conservation. Sensors, 16(1):97.
Guo, X., Shao, Q., Li, Y., Wang, Y., Wang, D., Liu, J., Fan, J., and Yang, F. (2018).
Application of uav remote sensing for a population census of large wild herbivores—taking
the headwater region of the yellow river as an example. Remote Sensing, 10(7):1041.
Han, J., Zhang, D., Cheng, G., Liu, N., and Xu, D. (2018). Advanced deep-learning techniques
for salient and category-specific object detection: a survey. IEEE Signal Processing Magazine,
35(1):84--100.
Harris, G., Thompson, R., Childs, J. L., and Sanderson, J. G. (2010). Automatic storage and
analysis of camera trap data. Bulletin of the Ecological Society of America, 91(3):352--360.
He, K., Gkioxari, G., Doll´ar, P., and Girshick, R. (2017). Mask r-cnn. in proceedings of the
ieee international conference on computer vision.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Spatial pyramid pooling in deep convolutional
networks for visual recognition. IEEE transactions on pattern analysis and machine
intelligence, 37(9):1904--1916.
33
He, Q., Zhao, Q., Liu, N., Chen, P., Zhang, Z., and Hou, R. (2019). Distinguishing individual
red pandas from their faces. In Lin, Z., Wang, L., Yang, J., Shi, G., Tan, T., Zheng, N.,
Chen, X., and Zhang, Y., editors, Pattern Recognition and Computer Vision, pages 714--724,
Cham. Springer International Publishing.
Hou, J., He, Y., Yang, H., Connor, T., Gao, J., Wang, Y., Zeng, Y., Zhang, J., Huang, J.,
Zheng, B., et al. (2020). Identification of animal individuals using deep learning: A case
study of giant panda. Biological Conservation, 242:108414.
Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. (2017). Densely connected
convolutional networks. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 4700--4708.
Ibraheam, M., Li, K. F., Gebali, F., and Sielecki, L. E. (2021). A performance comparison
and enhancement of animal species detection in images with various r-cnn models. AI,
2(4):552--577.
Jamil, S., Abbas, M. S., and Roy, A. M. (2022). Distinguishing malicious drones using vision
transformer. AI, 3(2):260--273.
Jask´olski, M. W. (2021). for human activity in arctic coastal environments--a review of selected
interactions and problems. Miscellanea Geographica, 25(2):127--143.
Kellenberger, B., Marcos, D., Lobry, S., and Tuia, D. (2019). Half a percent of labels is
enough: Efficient animal detection in uav imagery using deep cnns and active learning. IEEE
Transactions on Geoscience and Remote Sensing, 57(12):9524--9533.
Kellenberger, B., Marcos, D., and Tuia, D. (2018). Detecting mammals in uav images: Best
practices to address a substantially imbalanced dataset with deep learning. Remote sensing
of environment, 216:139--153.
Khaemba, W. M. and Stein, A. (2002). Improved sampling of wildlife populations using airborne
surveys. Wildlife research, 29(3):269--275.
34
Khan, W., Kumar, T., Cheng, Z., Raj, K., Roy, A. M., and Luo, B. (2022a). Sql and
nosql databases software architectures performance analysis and assessments--a systematic
literature review. arXiv preprint arXiv:2209.06977.
Khan, W., Raj, K., Kumar, T., Roy, A. M., and Luo, B. (2022b). Introducing urdu digits
dataset with demonstration of an efficient and robust noisy decoder-based pseudo example
generator. Symmetry, 14(10):1976.
Kim, J. S., Elli, G. V., and Bedny, M. (2019). Knowledge of animal appearance among sighted
and blind adults. Proceedings of the National Academy of Sciences, 116(23):11213--11222.
Kudo, H., Koshino, Y., Eto, A., Ichimura, M., and Kaeriyama, M. (2012). Cost-effective
accurate estimates of adult chum salmon, oncorhynchus keta, abundance in a japanese river
using a radio-controlled helicopter. Fisheries Research, 119:94--98.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436--444.
Lee, W. Y., Park, M., and Hyun, C.-U. (2019). Detection of two arctic birds in greenland and
an endangered bird in korea using rgb and thermal cameras with an unmanned aerial vehicle
(uav). PLOS ONE, 14(9):1--16.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll´ar, P. (2017a). Focal loss for dense object
detection. In Proceedings of the IEEE international conference on computer vision, pages
2980--2988.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll´ar, P. (2017b). Focal loss for dense object
detection. In Proceedings of the IEEE international conference on computer vision, pages
2980--2988.
Liu, S., Qi, L., Qin, H., Shi, J., and Jia, J. (2018). Path aggregation network for instance
segmentation. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 8759--8768.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., and Berg, A. (2016). Ssd:
Single shot multibox detector,‖in european conference on computer vision (eccv).
35
Loshchilov, I. and Hutter, F. (2017). Sgdr: Stochastic gradient descent with warm restarts.
Mannocci, L., Baidai, Y., Forget, F., Tolotti, M. T., Dagorn, L., and Capello, M. (2021).
Machine learning to detect bycatch risk: Novel application to echosounder buoys data in
tuna purse seine fisheries. Biological Conservation, 255:109004.
Meena, S. D. and Loganathan, A. (2020). Intelligent animal detection system using sparse multi
discriminative-neural network (smd-nn) to mitigate animal-vehicle collision. Environmental
Science and Pollution Research, 27:39619–39634.
Misra, D. (2020). Mish: A self regularized non-monotonic activation function.
Moreni, M., Theau, J., and Foucher, S. (2021). Train fast while reducing false positives:
Improving animal classification performance using convolutional neural networks. Geomatics,
1(1):34--49.
Naude, J. and Joubert, D. (2019). The aerial elephant dataset: A new public benchmark for
aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition Workshops, pages 48--55.
Norouzzadeh, M. S., Nguyen, A., Kosmala, M., Swanson, A., Palmer, M. S., Packer, C.,
and Clune, J. (2018). Automatically identifying, counting, and describing wild animals in
camera-trap images with deep learning. Proceedings of the National Academy of Sciences,
115(25):E5716--E5725.
O’Brien, T. (2010). Wildlife picture index and biodiversity monitoring: issues and future
directions. Animal Conservation, 13(4):350--352.
Ofli, F., Meier, P., Imran, M., Castillo, C., Tuia, D., Rey, N., Briant, J., Millet, P., Reinhard,
F., Parkan, M., et al. (2016). Combining human computing and machine learning to make
sense of big (aerial) data for disaster response. Big data, 4(1):47--59.
Parham, J., Stewart, C., Crall, J., Rubenstein, D., Holmberg, J., and Berger-Wolf, T. (2018). An
animal detection pipeline for identification. In 2018 IEEE Winter Conference on Applications
of Computer Vision (WACV), pages 1075--1083. IEEE.
36
Peng, J., Wang, D., Liao, X., Shao, Q., Sun, Z., Yue, H., and Ye, H. (2020). Wild animal
survey using uas imagery and deep learning: modified faster r-cnn for kiang detection in
tibetan plateau. ISPRS Journal of Photogrammetry and Remote Sensing, 169:364--376.
Petso, T., Jamisola, R. S., Mpoeleng, D., and Mmereki, W. (2021). Individual animal and herd
identification using custom yolo v3 and v4 with images taken from a uav camera at different
altitudes. In 2021 IEEE 6th International Conference on Signal and Image Processing
(ICSIP), pages 33--39. IEEE.
Pringle, R. M., Syfert, M., Webb, J. K., and Shine, R. (2009). Quantifying historical changes
in habitat availability for endangered species: use of pixel-and object-based remote sensing.
Journal of Applied Ecology, 46(3):544--553.
Rawat, W. and Wang, Z. (2017). Deep convolutional neural networks for image classification:
A comprehensive review. Neural computation, 29(9):2352--2449.
Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016). You only look once: Unified,
real-time object detection. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 779--788.
Redmon, J. and Farhadi, A. (2017). Yolo9000: better, faster, stronger. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 7263--7271.
Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental improvement.
Ren, S., He, K., Girshick, R., and Sun, J. (2016). Faster r-cnn: towards real-time object
detection with region proposal networks. IEEE transactions on pattern analysis and machine
intelligence, 39(6):1137--1149.
Rey, N., Volpi, M., Joost, S., and Tuia, D. (2017). Detecting animals in african savanna with
uavs and the crowds. Remote Sensing of Environment, 200:341--351.
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., and Savarese, S. (2019). Generalized
intersection over union: A metric and a loss for bounding box regression. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 658--666.
37
Roy, A. M. (2021). Finite element framework for efficient design of three dimensional
multicomponent composite helicopter rotor blade system. Eng, 2(1):69--79.
Roy, A. M. (2022a). Adaptive transfer learning-based multiscale feature fused deep convolutional
neural network for eeg mi multiclassification in brain--computer interface. Engineering
Applications of Artificial Intelligence, 116:105347.
Roy, A. M. (2022b). An efficient multi-scale CNN model with intrinsic feature integration for
motor imagery EEG subject classification in brain-machine interfaces. Biomedical Signal
Processing and Control, 74:103496.
Roy, A. M. (2022c). A multi-scale fusion cnn model based on adaptive transfer learning for
multi-class mi-classification in bci system. BioRxiv.
Roy, A. M. and Bhaduri, J. (2021). A deep learning enabled multi-class plant disease detection
model based on computer vision. AI, 2(3):413--428.
Roy, A. M. and Bhaduri, J. (2022). Real-time growth stage detection model for high degree
of occultation using densenet-fused YOLOv4. Computers and Electronics in Agriculture,
193:106694.
Roy, A. M., Bose, R., and Bhaduri, J. (2022). A fast accurate fine-grain object detection model
based on YOLOv4 deep neural network. Neural Computing and Applications, pages 1--27.
Ruff, Z. J., Lesmeister, D. B., Appel, C. L., and Sullivan, C. M. (2021). Workflow and
convolutional neural network for automated identification of animal sounds. Ecological
Indicators, 124:107419.
Saxena, A., Gupta, D. K., and Singh, S. (2021). An animal detection and collision avoidance
system using deep learning. In Advances in Communication and Computational Technology,
pages 1069--1084. Springer.
Schindler, F. and Steinhage, V. (2021). Identification of animals and recognition of their actions
in wildlife videos using deep learning techniques. Ecological Informatics, 61:101215.
38
Singh, A., Pietrasik, M., Natha, G., Ghouaiel, N., Brizel, K., and Ray, N. (2020). Animal
detection in man-made environments. In 2020 IEEE Winter Conference on Applications of
Computer Vision (WACV), pages 1427--1438.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).
Dropout: a simple way to prevent neural networks from overfitting. The journal of machine
learning research, 15(1):1929--1958.
Stern, E. R. and Humphries, M. M. (2022). Interweaving local, expert, and indigenous knowledge
into quantitative wildlife analyses: A systematic review. Biological Conservation, 266:109444.
Taheri, S. and
¨
Onsen Toygar (2018). Animal classification using facial images with score-level
fusion. IET Computer Vision, 12:679--685(6).
Torney, C. J., Lloyd-Jones, D. J., Chevallier, M., Moyer, D. C., Maliti, H. T., Mwita, M.,
Kohi, E. M., and Hopcraft, G. C. (2019). A comparison of deep learning and citizen science
techniques for counting wildlife in aerial survey images. Methods in Ecology and Evolution,
10(6):779--787.
Tzutalin (2015). Labelimg.
Voulodimos, A., Doulamis, N., Doulamis, A., and Protopapadakis, E. (2018). Deep learning for
computer vision: A brief review. Computational intelligence and neuroscience, 2018.
Wang, D., Shao, Q., and Yue, H. (2019). Surveying wild animals from satellites, manned
aircraft and unmanned aerial systems (uass): A review. Remote Sensing, 11(11):1308.
Xiao, Z., Xu, X., Xing, H., Luo, S., Dai, P., and Zhan, D. (2021a). Rtfn: a robust temporal
feature network for time series classification. Information Sciences, 571:65--86.
Xiao, Z., Xu, X., Xing, H., Song, F., Wang, X., and Zhao, B. (2021b). A federated learning
system with enhanced feature extraction for human activity recognition. Knowledge-Based
Systems, 229:107338.
Xiao, Z., Xu, X., Zhang, H., and Szczerbicki, E. (2021c). A new multi-process collaborative
architecture for time series classification. Knowledge-Based Systems, 220:106934.
39
Xing, H., Xiao, Z., Qu, R., Zhu, Z., and Zhao, B. (2022a). An efficient federated distillation
learning system for multitask time series classification. IEEE Transactions on Instrumentation
and Measurement, 71:1--12.
Xing, H., Xiao, Z., Zhan, D., Luo, S., Dai, P., and Li, K. (2022b). Selfmatch: Robust
semisupervised time-series classification with self-distillation. International Journal of
Intelligent Systems.
Yao, Z., Cao, Y., Zheng, S., Huang, G., and Lin, S. (2021). Cross-iteration batch normalization.
Zhao, Z.-Q., Zheng, P., Xu, S.-t., and Wu, X. (2019a). Object detection with deep learning: A
review. IEEE transactions on neural networks and learning systems, 30(11):3212--3232.
Zhao, Z.-Q., Zheng, P., Xu, S.-t., and Wu, X. (2019b). Object detection with deep learning: A
review. IEEE transactions on neural networks and learning systems, 30(11):3212--3232.
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., and Ren, D. (2020). Distance-iou loss: Faster
and better learning for bounding box regression. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 34, pages 12993--13000.
Zhu, X. X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., and Fraundorfer, F. (2017). Deep
learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience
and Remote Sensing Magazine, 5(4):8--36.
Zotin, A. G. and Proskurin, A. V. (2019). Animal detection using a series of images under
complex shooting conditions. The International Archives of the Photogrammetry, Remote
Sensing and Spatial Information Sciences, XLII-2/W12:249--257.