ArticlePDF Available

A Computer Vision-Based Object Localization Model for Endangered Wildlife Detection

January 2022
SSRN Electronic Journal

January 2022

DOI:10.2139/ssrn.4315295

Authors:

Arunabha M Roy

Texas A&M University

Content uploaded by Arunabha M Roy

Content may be subject to copyright.

A computer vision-based object localization model for endangered wildlife

detection

Arunabha M. Roy∗1, Jayabrata Bhaduri2, Teerath Kumar3, and Kislay Raj3

1Aerospace Engineering Department, University of Michigan, Ann Arbor, MI

48109, USA

Capacloud AI, Deep Learning &Data Science Division, Kolkata, WB 711103,

India.

3School of Computing, Dublin City University, Dublin 9, Ireland

Abstract

Objective. With climatic instability, various ecological disturbances, and human actions threaten

the existence of various endangered wildlife species. Therefore, an up-to-date accurate and

detailed detection process plays an important role in protecting biodiversity losses, conservation,

and ecosystem management. Current state-of-the-art wildlife detection models, however, often

lack superior feature extraction capability in complex environments, limiting the development

of accurate and reliable detection models. Method. To this end, we present WilDect-YOLO, a

deep learning (DL)-based automated high-performance detection model for real-time endangered

wildlife detection. In the model, we introduce a residual block in the CSPDarknet53 backbone

for strong and discriminating deep spatial features extraction and integrate DenseNet blocks to

improve in preserving critical feature information. To enhance receptive field representation,

preserve fine-grain localized information, and improve feature fusion, a Spatial Pyramid Pooling

∗Corresponding author, 4/09/2022

(SPP) and modified Path Aggregation Network (PANet) have been implemented that results

in superior detection under various challenging environments. Results. Evaluating the model

performance in a custom endangered wildlife dataset considering high variability and complex

backgrounds, WilDect-YOLO obtains a mean average precision (mAP) value of 96

89%, F1-score

of 97

87%, and precision value of 97

18% at a detection rate of 59.20 FPS outperforming current

state-of-the-art models. Significance. The present research provides an effective and efficient

detection framework addressing the shortcoming of existing DL-based wildlife detection models

by providing highly accurate species-level localized bounding box prediction. Current work

constitutes a step towards a non-invasive, fully automated animal observation system in

real-time in-field applications.

Keywords: Endangered wildlife detection; You Only Look Once (YOLOv4) algorithm; Object

Detection (OD); Computer vision; Deep Learning (DL); Wildlife Preservation

1. Introduction :

In recent years, automated wildlife detection plays a critical role in wildlife survey (Peng

et al.,2020;Chalmers et al.,2021;Delplanque et al.,2021), conservation (Khaemba and Stein,

2002;O’Brien,2010), and ecosystem management (Austrheim et al.,2014;Harris et al.,2010)

to tackle worldwide accelerated biodiversity crisis. Up-to-date detailed and accurate wildlife

data can be beneficial in preventing biodiversity losses, ecosystem damage, and poaching

(Norouzzadeh et al.,2018;Petso et al.,2021). While traditional wildlife survey techniques

mainly include distance sampling (Aebischer et al.,2017), camera trapping (Chauvenet et al.,

2017), and satellite monitoring (Chauvenet et al.,2017), however, such traditional techniques

have disadvantages due to lower efficiency, high cost, the requirement of qualified personals, and

their individual bias (Guo et al.,2018). Similarly, wild animal surveys with aerial image object

detection generally suffer from low accuracy due to complex backgrounds and disturbances

among wild animals (Eikelboom et al.,2019). Moreover, satellite-based monitoring methods

require very-high-resolution satellite imagery which are limited for relatively larger-sized animals

(Wang et al.,2019).

To circumvent such issues, various automatic and semi-automatic detection algorithms for

wildlife animals have been adopted, in particular, from unmanned aircraft systems (UASs)

imagery (Gonzalez et al.,2016;Ofli et al.,2016). Additionally, pixel-based classification methods

that include threshold setting, supervised, and unsupervised classification have been popular

methods for detecting animals in remote sensing images (Pringle et al.,2009;Kudo et al.,2012).

However, these methods are not adequate for detecting targets with similar gray-scale values

with the complex background (Wang et al.,2019). To detect targets in complex environments,

various machine learning (ML) methods have been employed to localize objects combining

rotation-invariant object descriptors for automated wildlife detection (Cheng and Han,2016).

Although, traditional ML yields encouraging results in relatively simple scenarios, however,

they are not adequate and robust methods for detecting complicated animal features such as

structure, texture, morphology, etc (Rey et al.,2017;Peng et al.,2020).

More recently, driven by big-data methods (Khan et al.,2022a), deep learning (DL)

characterized by multilayer neural networks (NN) (LeCun et al.,2015) has shown remarkable

breakthroughs in pattern recognition for various fields including image classification (Rawat

and Wang,2017;Jamil et al.,2022;Khan et al.,2022b), computer vision (Voulodimos et al.,

2018;Chandio et al.,2022), object detection (Zhao et al.,2019a;Roy and Bhaduri,2021;Roy

et al.,2022;Roy and Bhaduri,2022), time-series classification (Xiao et al.,2021a,c;Xing et al.,

2022a,b), brain-computer interface (Roy,2022b,a,c), and across diverse scientific disciplines

(Zhu et al.,2017;Roy,2021;Bose and Roy,2022). Particularly in object localization, DL

methods have demonstrated superior accuracy (Han et al.,2018) that can be categorized into

two classes: two-stage and one-stage detector (Lin et al.,2017a). Two-stage detectors including

Region Convolution Neural Network (RCNN) (Girshick,2015), faster-RCNN (Ren et al.,2016),

mask-RCNN (He et al.,2017) etc have shown a significant improvement in accuracy in object

localization. In recent times, You Only Look Once (YOLO) variants (Redmon et al.,2016;

Redmon and Farhadi,2017,2018;Bochkovskiy et al.,2020) have been proposed that unify

target classification and localization leading to significant improvement in the detection speed

(Roy et al.,2022;Roy and Bhaduri,2022,2021). Therefore, driven by advances in computer

vision technologies, wildlife detection is rapidly transforming into a data-rich discipline and has

been applied in the automated detection of a variety of wildlife species (Eikelboom et al.,2019;

Gon¸calves et al.,2020;Duporge et al.,2021). Along the similar line, various DL methodologies

such as convolutional neural network (CNN) (Kellenberger et al.,2018), RetinaNet (Eikelboom

et al.,2019), ResNet-50 (Chabot et al.,2022), YOLOv3 (Torney et al.,2019), Faster R-CNN

(Peng et al.,2020), Libra-RCNN (Delplanque et al.,2021) etc have demonstrated high precision

in object localization and can be deployed as a reliable and predictable model for automated

wildlife detection.

Motivations : The main motivation of the present study is to design an efficient and robust

computer vision-based algorithm for the accurate classification and localization of endangered

wildlife species. Climatic instability and various human activities such as thawing, hunting,

oil drilling, etc threaten the existence of various endangered animals and create damage to

ecosystems (Jask´olski,2021). Species that inhabit such ecosystems are highly specialized to

live in adverse weather conditions, which is why such changes affect them severely (Crooks

et al.,2017). Thus, it is crucial to build an accurate automated endangered wildlife detection

model to conserve and protect the species and the ecosystem. Although, there exists several

state-of-the-art works for wildlife detection (Barbedo et al.,2019;Naude and Joubert,2019;

Peng et al.,2020;Moreni et al.,2021) including multi-species animal detection (Eikelboom et al.,

2019;Delplanque et al.,2021), however, they often suffer from low accuracy, missed detection,

and relatively large computational overhead. Additionally, there is no systemic study, as per

the authors’ best knowledge, that addresses the challenge of detecting and accurate localization

of multiple endangered wildlife species that is worthy of further investigation. To this end,

the current works aim to develop an efficient and robust endangered wildlife classification and

accurate object localization model simultaneously productive in terms of training time and

computational cost which is currently missing in recent state-of-the-art models for endangered

wildlife detection.

Challenges : Despite illustrating outstanding performance in detecting wildlife species, current

state-of-the-art DL algorithms are still not suitable due to their insufficient fine-grain feature

extraction capability leading to missed detection and false object predictions for endangered

species which posses unique body textures, shapes, sizes, and colors (Kim et al.,2019). Between

various species, accurate detection and localization tasks can be challenging due to significant

variability of lightening conditions, low visibility, high degree of osculation and overlap,

the coexistence of multi-object classes with various aspect ratios, and other morphological

characteristics (Chabot et al.,2019). Additionally, visual similarities, complex background

and the low distinguishable interface between species and their surroundings, and various

other critical factors offer additional challenges and difficulties for the state-of-the-art wildlife

detection models (Feng and Li,2022).

To address the aforementioned shortcomings, in the current study, we present WilDect-YOLO,

based on an improved version of the state-of-art YOLOv4 detection model for accurate real-time

endangered wildlife detection. In WilDect-YOLO, we integrate DenseNet blocks to improve

preserving critical feature information and reuse. In addition, two residual blocks have been

carefully designed in the CSPDarknet53 backbone for strong and discriminating deep spatial

features extraction. Furthermore, Spatial Pyramid Pooling (SPP) has been tightly attached

to the backbone to enhance the representation of receptive fields. We have also utilized

a modified Path Aggregation Network (PANet) to efficiently preserve fine-grain localized

information by feature fusion. Additionally, we performed an extensive ablation study for

backbone-neck architecture to optimize both accuracy of detection and detection speed. The

proposed WilDect-YOLO has been employed to detect distinct eight different endangered wildlife

species that provide superior and accurate detection under various complex and challenging

environments. The WilDect-YOLO effectively addresses the shortcoming of existing DL-based

wildlife detection models and illustrates the superior potential in real-time in-field applications.

In short, current work constitutes a step toward a non-invasive, fully automated efficient animal

observation system.

2. Related Works :

In the present section, some recent and relevant works have been highlighted. More recently, a

two-channeled perceiving residual pyramid network (Ruff et al.,2021) has been proposed based

on audio signals that deliver superior detection accuracy. Furthermore, different techniques

such as segmentation-based YOLO model (Parham et al.,2018), fast-depth CNN-based

detection model from highly cluttered camera images (Singh et al.,2020), sparse multi

discriminative-neural network (SMD-NN) (Meena and Loganathan,2020), a fast image-enhancement

algorithm based on Multi-Scale Retinex (MSR) (Zotin and Proskurin,2019), CNN-based

model for facial detection (Taheri and

Onsen Toygar,2018), a semi-supervised learning-based

Multi-part CNN (MP-CNN) (Divya Meena and Agilandeeswari,2019), CNN with k-Nearest

Neighbor (kNN) has been utilized for wildlife detection that provides state-of-the-art performance.

In terms of endangered animal detection, there is only a handful of work that has been

geared toward addressing such an important issue. Notably, the DL-based model for classifying

red pandas (He et al.,2019); animal action recognition based on wildlife videos (Schindler and

Steinhage,2021) are some of the representative works in recent endeavors. Additionally, RGB

and thermal image-based Arctic bird detection using drones has been developed in (Lee et al.,

2019). After reviewing the aforementioned methods which are geared towards endangered

wildlife detection, the current works aim to develop an efficient and robust endangered wildlife

classification and accurate object localization model simultaneously productive in terms of

training time and computational cost which is currently lacking in the recent state-of-the-art

endeavors.

3. Endangered wildlife species dataset :

Since there is no publicly available endangered wildlife dataset, in the present work, we

have extensively collected high-resolution web-harvested images for different endangered species

under various complex backgrounds. The dataset used for the experimentation comprises

eight classes: Polar Bear (Ursus maritimus) , Gal´apagos Penguin (Spheniscus mendiculus),

Giant Panda (Ailuropoda melanoleuca), Red Panda (Ailurus fulgens), African forest elephant

(Loxodonta cyclotis), Sunda Tiger (Panthera tigris sondaica), Black Rhino (Diceros bicornis),

and African wild Dog (Lycaon pictus). Fig. 1shows some of the representative images from

the custom dataset for the eight different classes considered herein. Noteworthy to mention,

categories including Gal´apagos Penguin, Red Panda, African forest elephant, Sunda Tiger,

Black Rhino and African wild Dogs have been declared critically endangered species. In the

datasets, there are a total number of 1600 images of which there are 200 images for each class.

Figure 1: (a) Representative samples images from endangered wildlife dataset that consist of

eight classes: (a) Polar Bear; (b) Gal´apagos Penguin; (c) Giant Panda; (d) Red Panda; (e)

African forest elephant; (f) Sunda Tiger; (g) Black Rhino; and (h) African wild Dog

For the variability and challenges in the datasets, we have included images that characterize

limited and/or full illumination, low visibility, high degree of occultation, multiple objects

with overlap, complex backgrounds, the textural similarity of the object and the background,

and noisy environment. Additionally, the images of the dataset have variations in their scale,

orientation, and resolution.

4. Proposed Methodology for object localization:

In object detection, the target object classification and localization are performed simultaneously

where the target class has been categorized and separated from the background by drawing

bounding boxes (BBs) on input images containing the entire object. This can be particularly

useful for counting endangered species for accurate surveying. To this end, the main goal of the

current work is to develop an accurate and robust endangered wildlife localization model. In

this regard, different variants of YOLO (Redmon et al.,2016;Redmon and Farhadi,2017,2018;

wgt

hgt

(a)

(b)

Wildlife detection

Input N ×N grids

BBs+ confidence score

Class probability

Figure 2: Schematic of (a) YOLO object localization process for endangered wildlife detection;

(b) offset regression process for target BBs prediction during CIoU loss.

Bochkovskiy et al.,2020) are some of the best high-precision one-stage object detection models

that consist of the following parts: a backbone for semantic deep feature extraction, followed by

the neck for hierarchical feature fusion, and finally detection head for object classification and

localization. The overall schematic of the YOLO object localization process has been depicted

in Fig. 2where the YOLO algorithm transforms the object detection task into a regression

problem by generating BBs coordinates and probabilities for each class. During the process,

the inputted image size has been uniformly divided into

N×N

grids where

predictive BBs

have been generated. Subsequently, a confidence score has been assigned if the target object

falls inside that particular grid. It detects the target object for a particular class when the

center of the ground truth lies inside a specified grid. During detection, each grid predicts

numbers of BBs with the confidence value ΘBas:

ΘB=Pr(obj)×IoUt

p∨ Pr(obj)∈0,1 (1)

where

(

obj

) infers the accuracy of BB prediction, i.e.,

(

obj

) = 1 indicates that the target

class falls inside the grid, otherwise,

(

obj

) = 0. The degree of overlap between ground truth

and the predicted BB has been described by the scale-invariant evaluation metric intersection

over union (IoU) which can be expressed as

IoU = Bp∩Bt

Bp∪Bt

(2)

where B

and B

are the ground truth and predicted BBs, respectively. However, to further

improve BBs regression and gradient disappearance, generalized IoU (GIoU) (Rezatofighi et al.,

2019) and distance-IoU (DIoU) (Zheng et al.,2020) as been introduced considering aspect ratios

and orientation of the overlapping BBs. More recently, complete IoU (CIoU) (Zheng et al.,

2020) has been proposed for improved accuracy and faster convergence speed in BB prediction

which can be expressed as

LCIoU = 1 + βξ +α2(bp,bt)

η2−IoU (3)

ξ=4

π2tan−1wt

−tan−1wp

hp2

;β=ξ

(1 −IoU) + ξ0(4)

where

bgt

and b

denotes the centroids of B

and B

, respectively;

and

are the consistency

and trade-off parameters, respectively. As shown in Fig. 2-(b),

is the smallest diagonal

length of B

p∪

;

wgt

are widths and

hgt

are heights of B

and B

, respectively.

With increasing

wp/hp

, we get

ξ→

0 from Eq. 4. Therefore, to optimize the influence of

the CIoU,

wp/hp

can be properly chosen for the YOLO model. Finally, the best BB prediction

can be obtained from the non-maximum suppression (NMS) (Ren et al.,2016) algorithm from

multiple scales.

4.1 WilDect-YOLO architecture:

In recent endeavors, various attempts have been made on computer vision-based object detection

algorithm for accurate wildlife detection and survey utilizing deep CNN (Kellenberger et al.,

2019), R-CNN (Ibraheam et al.,2021), Faster R-CNN (Peng et al.,2020), single shot multi-box

detector (SSD) (Saxena et al.,2021), and YOLO (Choe and Kim,2020). Although the

aforementioned techniques have demonstrated outstanding performance, however, the detection

of endangered wildlife detection task, specifically in Polar and African regions, faces several

specific challenges, in particular, significant variability of lightening conditions, low visibility,

high degree of osculation and overlap, the coexistence of multiple target classes with various

aspect ratios, visual similarities, complex backgrounds, and the low distinguishable interface

between species and its surroundings. Such challenging conditions lead to false object prediction

with a large number of missed detection from the original YOLOv4 (Bochkovskiy et al.,2020)

due to its insufficient fine-grain feature extraction capabilities.

To resolve the existing issues, in the current work, we propose a novel object localization

algorithm WilDect-YOLO based on a state-of-the-art YOLOv4 network, specially designed

for endangered wildlife detection, to enhance feature extraction, preserve fine-grain localized

information and improve feature fusion that provides superior detection under various challenging

environments. The model has been optimized to achieve better efficiency and accuracy of BB

prediction based on the characteristics and complexities of the endangered wildlife dataset

considered herein. The overall network of the object localization model is shown in Fig.

3. To improve performance in terms of classification accuracy and object localization, we

Input: (416, 416, 3)

Down sample:

Up sample:

Concatenate:

13×13×24

26×26×24

52×52×24

Detection

Class Loss

CIoU Loss

Confidence Loss

CSPX2×3

CSPX2-3

Dense-CSPDarknet53

CSPX2×3

Modified PANet

Head

CSP1

CSP2

CSP8

D-CSPX1-4

D-CSPX1-2

CBH

Conv2D

CBH

Conv2D

CBH

Conv2D

CBH

CSP1

CSP2

CSP4

CSPX1-3

CSPX1×3

Dense B-2

CSPX1-2

CSPX1-4

CSPX2-3

CSPX2×3

CSPX2-3

CSPX2×3

52×52×24

26×26×24

CSPX2-3

CBL

SPP

MaxPool (5)

MaxPool (9)

MaxPool (13)

CBH

Dense B-1

CBH

CSP8

CSP2

CSP1

Figure 3: Schematic of the proposed WilDect-YOLO consists of improved Dense-CSPDarknet53

with residual block CSPX1-

and SPP in the backbone, modified PANet in the neck part with

regular YOLO head.

perform extensive experiments, and various modifications are proposed which are detailed in

the subsequent sections.

4.2 Improvement of discriminative feature extraction:

In the present study, we have introduced a residual block CSPX1-

where

represents

residual weighting operations to improve detection speed and performance. We integrate

CSPX1-

modules in the CSPDarknet53 backbone replacing the original CSP8 and CSP4

residual blocks to extract fine-grained rich semantic information as shown in Fig. 3. In the

CSPX1-

block, we divide the input features into two parts. In the first part, (3

3) convolution

was performed followed by an additional (3

3) convolution to maintain the number of feature

maps after entering the next residual unit as shown in Fig. 4-(a). To further improve the

feature extraction, we perform 3

3 convolution at the end. Whereas, the second part acts as

a residual edge for the convolution. These two parts have been concatenated at the end to

improve the semantic feature information. Implementation of the CSPX1-

modules in the

improved CSPDarknet53 helps to learn more expressive features that demonstrate significant

improvement of detection accuracy for the custom wildlife datasets used herein.

4.3 Preserving critical feature information:

To preserve critical feature maps and efficiently reuse the discriminative feature information,

we have fused DenseNet (Huang et al.,2017) in the original CSPDarknet53. In DenseNet,

each layer has been connected to other layers in a feed-forward mode where

-th layer can

receive the important feature information

from all the previous layers

X0, X1, ..., Xn−1

[

X0, X1, ..., Xn−1

] where

is the feature map function for

-th layer. The schematic

of the DenseNet blocks network structure have been shown in Fig. 4-(b, c). As shown in Fig. 3,

we have introduced two DenseNet blocks; the first block (Dense B-1) has been attached before

cross-stage partial block CSPX1-4; whereas the second block (Dense B-2) has been placed

before CSPX1-2 in the proposed WilDect-YOLO network which results in enhance feature

propagation. It has been found that DenseNet significantly improves the feature transfer and

26×26×24

Res Unit

CSPX2-n

(Res Unit) ×n

Part I

Part II

CSPX2-n Block

CBH

L-ReLU

Conv2D

Input: (26×26×256)

Output:

( 26×26×512)

Transition

Layer

( 26×26×320)

( 26×26×384)

( 26×26×448)

Transition

Layer

Output:

( 13×13×1024)

( 13×13×896)

( 13×13×768)

Input: (13×13×512)

( 13×13×640)

Dense Block -1

Dense Block -2

(a)

(b)

Res Unit

CSPX1-n

(CBH) ×3

Part I

Part II

CSPX1-n Block

CBH

Conv2D

L-ReLU

CBH

(b)

(c)

CBH

(d)

Figure 4: Schematic of (a) CSPX1-

residual block; (b) dense block (DB)-1; (c) dense block

(DB)-2; (d) CSPX2-nresidual block architecture used in WilDect-YOLO detection model.

mitigates over-fitting in the proposed detection network. Additionally, by reducing redundant

feature operations, such implementation improve the computational speed.

4.4 Receptive field enhancement:

One of the requirements of CNN is to have fixed-size input images. However, due to the

different aspect ratios of the images, they have been fixed by cropping and warping during the

convolution process which results in losing important features. In this regard, SPP (He et al.,

2015) applies an efficient strategy in detecting target objects at multiple length scales. To

this end, we have added an SPP block integrated with CSPX1-2 of the Dense-CSPDarknet53

backbone to improve receptive field representation and extraction of important contextual

features as shown in Fig. 4. In the proposed model, a modified SPP consisting of various sizes

of sliding kernels (i.e., 5

5, 9

9, and 13

13 ) with maximum pooling has been prescribed

that effectively increases the receptive field representation of the backbone.

4.5 Preserving fine-grain localize information:

In addition, an improved PANet (Liu et al.,2018) integrated with CSPX2-

has been utilized

as a neck of the detection model as shown in Fig. 2. It can efficiently combine high and low

feature fusion for multi-scale feature pyramid maps preserving fine-grain localized information.

Additionally, by employing flexible ROI pooling and element-wise max operation, PANet can

efficiently fuse the information from previous feature layers resulting in significant improvement

in the detection accuracy of the model.

Furthermore, CIoU loss function (Zheng et al.,2020), dropblock regularization (Ghiasi et al.,

2018), Cross mini Batch Normalization (Yao et al.,2021), dropout in feature map (Srivastava

et al.,2014), and cosine annealing scheduler (Loshchilov and Hutter,2017) have been employed

to further improve the performance of WilDect-YOLO. We use the original YOLOv3 head in

the final part of the detection network. Utilizing 416

416

3 image size as the input, the

detection head of the WilDect-YOLO can predict BBs in three different scales: (13

24),

(26

24), and (52

24) as shown in Fig. 2. After extensive experiments, we have

found that Mish (Misra,2020) activation provides the optimal performance in terms of model

accuracy. Overall, our proposed methodology provides the best results in terms of accuracy

and performance compared to current state-of-the-art models for endangered wildlife detection

(see Section 6.2 )

5. Training and performance :

5.1 Training procedure :

In the present work, we have performed an extensive and elaborate study to explore the

comparative performance analysis of the proposed WilDect-YOLO models for endangered

wildlife classification and object localization. From the initial custom endangered wildlife

species dataset consisting of 1,600 images has been further expanded tenfold by utilizing various

data augmentation procedures (i.e., color balancing, rotation, blur processing, mirror projection,

brightness transformation) to obtain the final dataset of a total of 16,000 images (2,000 images

per class). From the final dataset, a total of 60%, 20%, and 20% images have been randomly

chosen for training, validation, and test sets, respectively. For the training set, LabelImg

(Tzutalin,2015) has been used for the annotation of BBs around the target classes. For all

the experiments, we have used a Windows 10 Pro (64-bit) based computational system that

has Intel Core i5-10210U with CPU @ 2.8 GHz

6, 32 GB DDR4 memory, NVIDIA GeForce

RTX 2080 utilizing CUDA 10.2.89 and cuDNN 10.2 v7.6.5 for GPU parallelization. As required

CV libraries, Visual Studio v15.9 (2017), and OpenCV 4.5.1-vc14 have been integrated with

DarkNet. Unless otherwise stated, a batch size set to 32 with a total number of training steps

has been kept as 85,000 during training. The initial learning rate has been set to 0.001. The

training dataset has been trained utilizing the available pre-trained weights-file (AlexeyAB,

2021). Various training hyperparameters for WilDect-YOLO have been detailed in Table 1.

5.2 Performance metrics:

In the present work, the performance of the object detection models has been evaluated

Table 1: Various hyparameters values for training the WilDect-YOLOv model

Image size Sub-division Batch Channels Decay

416 ×416 ×3 8 32 6 0.005

Initial learning rate Momentum Classes Training steps Filters

0.001 0.9 8 85,000 36

by common standard measures (Ferri et al.,2009) including average precision (AP), precision

(P), recall (R), IoU, F-1 score, mean average precision (mAP), etc. The confusion matrix

obtained from the evaluation procedure provides the following interpretations of the test results:

true positive (TP), false positive (FP), false negative (FN), and true negative (TN). During

binary classification, the classified object can be defined as TP for IoU

≥

5. Whereas, it can

be classified as FP for IoU

5. Based on the aforementioned interpretations, the metric P of

the classifier can be defined by its ability to distinguish target classes correctly as :

P=T P

(T P +F P ); (5)

The ratio of the correct prediction of target classes is called R of the classifier which can be

evaluated as:

R=T P

(T P +F N )(6)

The higher values of P and R indicate superior detection capability. Whereas, the F-1 score is

the arithmetic mean of the P and R given as :

F1−score = 2P×R

P+R.(7)

A relatively high F1 score represents a robust detection model. The performance metrics AP

can be defined as the area under a P-R curve (Davis and Goadrich,2006) as follows

AP =Z1

P(R) dR. (8)

A higher average AP value indicates better accuracy in predicting various object classes. In

addition,

AP50:95

denotes AP over IoU=0

50 : 0

05 : 0

95; AP

and AP

are APs at IoU

threshold of 50% and 75%, respectively. The AP for detecting small, medium, and large objects

can be measured through AP

, AP

, and AP

, respectively. Finally, mAP can be obtained

from the average of all APs as:

mAP =1

i=1

APi.(9)

6. Results:

In this section, the performance and detection accuracy of the proposed WilDect-YOLO

frameworks have been discussed which have been evaluated in a custom-made endangered

wildlife dataset consisting of 8 classes. For better clarity in BBs representation, the following

BB class identifiers have been associated in the detection results: class 1- Polar Bear; class 2-

Gal´apagos Penguin; class 3- Giant Panda; class 4- Red Panda; class 5- African forest elephant;

class 6- Sunda Tiger; class 7- Black Rhino; and class 8- African wild Dog. The performance of

the WilDect-YOLO network has been optimized through extensive ablation studies. Finally,

the performance of the proposed model has been studied in detail and compared with several

state-of-the-art object detection models.

6.1 Optimization of network performance:

At first, we conduct extensive experiments to select proper backbone-neck combinations

to optimize the performance of the proposed WilDect-YOLO model in terms of both detection

accuracy and speed. For different combinations of backbone-neck configurations, detection

accuracy in terms of parameters AP, AP

, AP

, and AP

as well as detection

speed (in FPS) has been reported in Table. 2. For the comparison, we select Mish as the

activation function. From the Table. 2, one can see that DenseNet blocks in CSPDarknet53

(i.e., D-CSPDarknet-53) improve the accuracy of the detection model compared to the original

Table 2: Performance of various residual and dense block combinations in WilDect-YOLO

architecture for anchors size of 416 ×416.

Backbone

+ add-in

Neck

+add-in

AP AP50 AP75 APSAPMAPLFPS

CSPDarknet53 PANet 76.8 93.6 92.5 80.9 89.2 80.9 59.6

D-CSPDarknet53 PANet 78.4 96.1 92.2 78.3 87.7 81.7 61.1

D-CSPDarknet53+CSPX1-nPANet 79.5 96.1 92.5 77.9 88.2 82.9 60.1

CSPDarknet53 PANet+CSPX2-n77.1 95.6 91.2 74.1 87.9 84.7 63.2

D-CSPDarknet53+CSPX1-nPANet+CSPX2-n81.7 96.9 92.3 87.8 92.5 88.5 59.2

YOLOv4. The performance is further improved by introducing CSPX1-

into D-CSPDarknet53.

However, such a configuration results in a slight decrease in detection speed. We observe

that the best performance has been achieved when both CSPX1-

and CSPX2-

have been

integrated into D-CSPDarknet53 and PANet, respectively. There is a significant improvement

in the accuracy parameter, in particular, AP, AP

, and AP

increase by 4.9%, 6.9%, and 7.6%,

respectively compared to CSPDarknet53+PANet configuration. Thus, a such configuration

in WilDect-YOLO provides the optimal performance in terms of detection accuracy and

speed for the custom wildlife species data set considered herein. In summary, together with

proper activation function and improved backbone-neck combination provide an efficient

high-performance model for wildlife detection in complex scenarios.

6.2 Comparison with existing state-of-the-art models:

In this section, the detection performance of WilDect-YOLO is compared with some of the

existing state-of-the-art detection models (Zhao et al.,2019b). For the performance comparison,

we consider Faster R-CNN (Ren et al.,2016), Mask R-CNN He et al. (2017), RetinaNet

(Lin et al.,2017b), SSD Liu et al. (2016), YOLOv3 (Redmon and Farhadi,2018), YOLOv4

(Bochkovskiy et al.,2020), and Dense-YOLOv4 (Roy and Bhaduri,2022) that are trained

in the custom wildlife dataset in OpenMMLab object detection toolbox Chen et al. (2019).

Comparison of different performance parameters including P, R, F1-score, mAP, and detection

Table 3: Comparison of different performance parameters including P, R, F1, mAP, and

detection speed (in FPS) between WilDect-YOLO and other state-of-the-art models where bold

highlights the best performance values.

Model P (%) R (%) F1-score (%) mAP (%) Dect. time (ms) FPS

Faster R-CNN 71.32 72.39 71.85 73.17 41.12 24.32

RetinaNet 75.11 77.67 76.36 77.11 32.89 30.40

SSD 76.13 80.19 78.10 80.52 28.22 35.43

Mask R-CNN 78.22 83.35 80.70 81.61 50.72 19.72

YOLOv3 83.61 87.47 85.49 86.61 25.11 39.82

YOLOv4 90.19 93.79 91.95 91.29 17.21 58.10

Dense-YOLOv4 93.53 96.42 94.95 93.61 16.77 59.63

WilDect-YOLO 97.18 98.56 97.87 96.89 16.89 59.20

speed obtained from these models have been shown in Table 3. The comparison reveals

that the accuracy of R-CNN, RetinaNet, SSD, and Mask R-CNN is quite inferior compared

to YOLO variants as visually illustrated in the bar-chart plot in Fig. 5. Between YOLOv3

and YOLOv4, YOLOv4 demonstrated better performance with a 6

46% increase in F1 and

68% increase in mAP, respectively. We observe that the performance of Dense-YOLOv4 is

superior to the original YOLOv4 with 3

34%, 2

63%, 3

01%, and 2

32% increase in P, R, F1,

and mAP, respectively. However, WilDect-YOLO yields the best performance reaching the

values of 97

18%, 98

56%, 97

87%, and 96

89% in P, R, F1, and mAP, respectively as shown

in Fig.5. Moreover, WilDect-YOLO provides a superior real-time detection speed of 59.21

FPS which is 3

34% higher than the original YOLOv4 model. In summary, WilDect-YOLO

outshines some of the best detection models in terms of both detection accuracy and speed

suitable for automated high-performance wildlife detection models.

6.3 Overall performance of WilDect-YOLO:

From the previous section, it has been observed that YOLOv4, Dense-YOLOv4, and WilDect-YOLO

provide better performance compared to other state-of-the-art models. Therefore, these three

100

1 2 3 4 5

F R-CNN

SSD

M R-CNN

Yv4

D-Yv4

WD-Y

Figure 5: Comparison bar chart of different performance parameters including P, R, F1-score,

mAP, and detection speed (in FPS) between WilDect-YOLO and other state-of-the-art models.

Table 4: Overall performance comparison between original YOLOv4, Dense-YOLOv4, and

WilDect-YOLO.

Detection model IoU F1 mAP Validation loss Detection

time (ms)

Detection

speed (FPS)

YOLOv4 0.810 0.919 0.913 12.07 17.21 58.10

Dense-YOLOv4 0.881 0.949 0.936 5.31 16.77 59.63

WilDect-YOLO 0.917 0.979 0.969 1.88 16.89 59.21

models are closely compared in terms of mAP, F1, IoU, final loss, and average detection

time as shown in Table 4. The proposed WilDect-YOLOv has achieved the highest average

IoU value of 0.917 indicating superior BB accuracy during target detection compared to the

other two models. Similarly, it has also illustrated better detection performance and accuracy

by achieving the highest F1 and mAP values of 97

9% and 96

9% which are 6

1% and 5

improvement over the original YOLOv4, respectively. Furthermore, the detection speed of

59.21 FPS obtained from WilDect-YOLO was found to be higher than YOLO and slightly less

than Dense-YOLOv4. Thus, it can provide real-time detection of wildlife species with better

accuracy compared to the other two models. In addition, the comparison of P-R curves between

the three models have been depicted in Fig 6-(a). From the comparison of the P-R curves, one

0.2

0.4

0.6

0.8

1.2

0 0.2 0.4 0.6 0.8 1

YOLOv3

YOLOv4

Improved YOLOv4

100

120

140

0 1 2 3 4 5 6 7 8 9

yolov3

YOLOv4

Improved YOLOv4

Figure 6: Comparison of (a) P-R curves; (b) loss evolution curves between original YOLOv4,

Dense-YOLOv4, and WilDect-YOLO.

can see that WilDect-YOLO attains a better P value for a particular R. It achieved the highest

area under the P-R curve indicating superior detection performance compared to YOLOv4

and Dense-YOLOv4. Next, we compare the loss evolution curves as shown in Fig 6-(b). In

the initial phase, after exhibiting several cycles of fluctuation, the loss in the WilDect-YOLO

model tends to saturate after approximately 20,000 training steps with a final loss value of

1.88. Whereas, the other two models exhibit higher fluctuation in loss evolution and yield

higher final loss value. Evidently, the proposed WilDect-YOLO is easier to train with faster

convergence characteristics demonstrating its efficacy from the computational point of view.

To further gain insight into the performances of these models, detection result containing

TP, FP, and FN for each class and corresponding P, R, and F-1 values from Dense-YOLOv4

and WilDect-YOLO has been shown in Table 5. WilDect-YOLO has illustrated significant

improvement in P and R values for various classes, in particular, for detecting Galapagoes

Penguine, African Elephant, and Black Rhino classes. WilDect efficiently maximizes the TP

value while simultaneously reducing FP and FN values for all classes. The proposed model

improves 3

65% in P and 2

14% in R compared to Dense-YOLOv4. From the overall comparison,

we can conclude that WilDect-YOLO demonstrated the best performance in detecting various

endangered wildlife species outperforming both YOLOv4 and Dense-YOLOv4 in terms of

Table 5: Comparison of detection results for individual classes between Dense-YOLOv4 and

WilDect-YOLO

Model Class Objects TP FP FN P (%) R (%) F1-score

WilDect-YOLO

All 10070 9694 281 141 97.18 98.56 97.87

Polar Bear 675 656 12 8 98.20 98.79 98.49

Galap. Penguine 1453 1398 33 27 97.69 98.10 97.89

Giant Panda 1211 1201 23 11 98.12 99.09 98.60

Red Panda 789 756 12 09 98.43 98.82 98.63

African Elephant 1987 1878 89 32 95.47 98.32 96.87

Sunda Tiger 987 923 44 19 95.44 97.98 96.70

Black Rhino 1001 981 23 12 97.70 98.79 98.24

Wild Dog 1967 1901 45 23 97.68 98.80 98.24

Dense-YOLO4

All 10070 9291 642 345 93.53 96.42 94.95

Polar Bear 675 621 37 22 94.37 96.58 95.46

Galap. Penguine 1453 1378 39 32 97.24 97.73 97.48

Giant Panda 1211 1118 87 52 92.78 95.56 94.14

Red Panda 789 740 29 26 96.22 96.60 96.41

African Elephant 1987 1801 177 78 91.05 95.84 93.38

Sunda Tiger 987 901 76 32 92.22 96.57 94.34

Black Rhino 1001 921 87 36 91.36 96.23 93.74

Wild Dog 1967 1811 110 67 94.27 96.43 95.34

precision and accuracy values.

6.4 Detection of various animal species:

In this section, we have demonstrated the detection results for eight different classes of

endangered animal species from the proposed WilDect-YOLO and compared them with

Dense-YOLOv4. The visual representations of the detection results have been presented

with confined BBs considering complex backgrounds and challenging environments as shown

in Figs. 7-10. Corresponding detailed detection results consisting of the number of detected

and undetected target classes with average confidence scores have been reported in Table.

6. In Fig. 7, we tested the model for detecting Polar Bears and Galapagos Penguins in a

challenging scenario where the target objects have been placed in a similar textured background.

The proposed model shows its efficacy by preciously detecting the target objects with high

average confidence index values. In separate cases, we have considered detection for Giant

Panda and Red Panda classes where multiple target objects have a significant degree of overlap

1111

22 2 2

222

2 2

222

Figure 7: Detection results for Polar Bear (class-1) and Galapagos Penguin (class-2) from

the proposed WilDectYOLO and Dense-YOLOv4. Detailed detection results with average

confidence indexes have been shown in Table 6.

Table 6: Detailed detection results from WilDect-YOLO and Dense-YOLOv4 for different

classes as shown in Figs. 7-10.

Species Figs. No Model Detc. Undetc. Avg. confidence Score

Polar Bear 7(a)-(c) WilDect-YOLO 10 0 0.96

Polar Bear 7(a-i)-(c-i) Dense-YOLOv4 10 0 0.91

Galap. Penguine 7(d)-(f) WilDect-YOLO 16 0 0.93

Galap. Penguine 7(d-i)-(f-i) Dense-YOLOv4 13 3 0.88

Giant Panda 8(a)-(c) WilDect-YOLO 18 1 0.94

Giant Panda 8(a-i)-(c-i) Dense-YOLOv4 14 5 0.83

Red Panda 8(d)-(f) WilDect-YOLO 7 0 0.98

Red Panda 8(d-i)-(f-i) Dense-YOLOv4 7 0 0.93

African Elephant 9(a)-(c) WilDect-YOLO 14 0 0.92

African Elephant 9(a-i)-(c-i) Dense-YOLOv4 10 4 0.83

Sunda Tiger 9(d)-(f) WilDect-YOLO 8 0 0.99

Sunda Tiger 9(d-i)-(f-i) Dense-YOLOv4 7 1 0.92

Black Rhino 10 (a)-(c) WilDect-YOLO 7 0 0.98

Black Rhino 10 (a-i)-(c-i) Dense-YOLOv4 6 1 0.91

Wild Dog 10 (d)-(f) WilDect-YOLO 16 0 0.91

Wild Dog 10 (d-i)-(f-i) Dense-YOLOv4 11 5 0.77

333

444

Figure 8: Detection results for Giant Panda (class-3) and Red Panda (class-4) from the proposed

WilDectYOLO and Dense-YOLOv4. Detailed detection results with average confidence scores

have been shown in Table 6.

5555

555

5555

666

Figure 9: Detection results for African Elephant (class-5) and Sunda Tiger (class-6) from

the proposed WilDectYOLO and Dense-YOLOv4. Detailed detection results with average

confidence scores have been shown in Table 6.

between them. From the detection result, one can see that the bounding box prediction from

the proposed WilDect-YOLO is quite accurate in detecting each target object as illustrated

in Fig. 8. In Fig. 9, we have extended the detection for African Elephant and Sunda Tiger

classes where the target class is placed in a complex and challenging background. Detection

results from WilDect-YOLO in terms of boundary box precision are more accurate compared to

Dense-YOLOv4 as shown in Table. 6. To further illustrate the efficacy of the WilDect-YOLO

detection performance, we have considered the detection of Black Rhino and African Wild Dog

cases that have a high degree of occlusion, and dense overlapping between object classes. This

is quite a challenging task to detect target objects individually. In such cases, the detection

results from WilDect-YOLO elucidate superior detection accuracy by preciously detecting each

target class with high confidence index as shown in Figs. 10.

Additionally, for poorly visible multiple target objects due to insufficient lightening conditions,

the proposed localization algorithm performs well without missed detection as demonstrated in

Figs. 7-10. For high-aspect-ratio object detection cases with the presence of irregular shapes and

the similarity of their texture with surrounding environments, the proposed the model yields

888

77888

88888

Figure 10: Detection results for Black Rhino (class-7) and African wild Dog (class-8) from

the proposed WilDectYOLO and Dense-YOLOv4. Detailed detection results with average

confidence scores have been shown in Table 6.

good performance in such challenging scenarios. The overall detection result illustrates accurate

and robust bounding box prediction from WilDect-YOLO for all target classes compared to

Dense-YOLOv4.

7. Discussion :

The current study proposes an efficient automated detection framework for the endangered

wildlife species which can be deployed for animal surveys in various demographic regions without

human intervention. Thus, it can significantly reduce the cost of operation, manual equipment,

and overcome the difficulties of working in these adverse weather conditions. The current

framework illustrates its superior capability of detecting various endangered animals which

are significantly different in terms of body textures, shapes, sizes, colors, and morphological

characteristics. Furthermore, in the presence of various detection challenges such as visual

similarities, complex backgrounds, a high degree of occultation and overlap, and the low

distinguishable interface between species and its surroundings, the proposed model can replace

current state-of-the-art detection models in terms of accuracy and robustness. Additionally,

the current deep learning framework can be extended to UAS imagery to further expand the

capability of detecting various wildlife animals. With improved feature extraction capability and

an efficient localization algorithm, the proposed model can be suitable for detecting small-size

animals from relatively low-resolution images as well as satellite imagery. Although, the present

work focus on endangered animal detection, however, the current framework can be extended to

more generalized automated animal species detection for comprehensive and systematic wildlife

animal surveys. Furthermore, the current work can be integrated with geographic information

systems (GIS) for analyzing the migrations and activities of wild animals. Moreover, one

of the potential applications can be assembling object detection framework with semantic

segmentation methods such as Mask R-CNN (Bharati and Pramanik,2020), U-Net (Esser et al.,

2018) to extract additional physical information such as diseases, body fat, height as well as

various animal activities including eating, running, and resting which can be helpful in better

understanding animal health and habits (Norouzzadeh et al.,2018). Nevertheless, the current

deep-learning model outshines classical automated image analysis and various state-of-the-art

approaches in wildlife animal detection indicating future improvements in performance and

usability for the precise and accurate endangered animal survey which can be applied to

various automated wildlife monitoring (Desgarnier et al.,2022;Hou et al.,2020;Chen et al.,

2020;Mannocci et al.,2021;Arbieu et al.,2021) and different biological conservation purposes

(Stern and Humphries,2022). The current framework can also be extended for various fault

detection/thermal imaging(Glowacz,2021b,a,c), human activity recognition (Xiao et al.,2021b)

etc.

8. Conclusions :

Summarizing, in the present work, we have developed an efficient and robust object localization

algorithm WilDect-YOLO is based on computer vision for accurate classification and localization

of various endangered wildlife species. In the proposed network, we integrate DenseNet blocks

to improve feature critical feature information and two new residual blocks for efficient deep

spatial feature extraction. In addition, SPP and improved PANet modules have been employed

to efficiently preserve fine-grain localized information by feature fusion. Evaluated on a

custom-made dataset for endangered wildlife species, it has been found that at a detection rate

of 59.20 FPS, WilDect-YOLO has achieved mAP, F1-score, and precision values of 96

89%,

87%, and 97

18%, respectively outperforms existing state-of-the-art wildlife detection models

in terms of both classification accuracy and localized bounding box prediction in detecting

various wildlife spices. Current work effectively addresses the shortcoming of existing deep

learning-based wildlife detection models and constitutes a step toward a fully automated

accurate automated wildlife monitoring system in real-time in-field applications.

Acknowledgements: The support of the Aeronautical Research and Development Board

(Grant No. DARO/08/1051450/M/I) is gratefully acknowledged.

Conflict of interest: The authors declare that they have no known competing financial

interests or personal relationships that could have appeared to influence the work reported in

this paper.

Data availability: The data that support the findings of this study are available upon

reasonable request.

References

Aebischer, T., Siguindo, G., Rochat, E., Arandjelovic, M., Heilman, A., Hickisch, R., Vigilant,

L., Joost, S., and Wegmann, D. (2017). First quantitative survey delineates the distribution

of chimpanzees in the eastern central african republic. Biological Conservation, 213:84--94.

AlexeyAB (2021). Pre-trained weights-file.

Arbieu, U., Helsper, K., Dadvar, M., Mueller, T., and Niamir, A. (2021). Natural language

processing as a tool to evaluate emotions in conservation conflicts. Biological Conservation,

256:109030.

Austrheim, G., Speed, J. D., Martinsen, V., Mulder, J., and Mysterud, A. (2014). Experimental

effects of herbivore density on aboveground plant biomass in an alpine grassland ecosystem.

Arctic, Antarctic, and Alpine Research, 46(3):535--541.

Barbedo, J. G. A., Koenigkan, L. V., Santos, T. T., and Santos, P. M. (2019). A study on the

detection of cattle in uav images using deep learning. Sensors, 19(24):5436.

Bharati, P. and Pramanik, A. (2020). Deep learning techniques—r-cnn to mask r-cnn: a survey.

Computational Intelligence in Pattern Recognition, pages 657--668.

Bochkovskiy, A., Wang, C.-Y., and Liao, H.-Y. M. (2020). Yolov4: Optimal speed and accuracy

of object detection.

Bose, R. and Roy, A. (2022). Accurate deep learning sub-grid scale models for large eddy

simulations. Bulletin of the American Physical Society.

Chabot, D., Stapleton, S., and Francis, C. M. (2019). Measuring the spectral signature of

polar bears from a drone to improve their detection from space. Biological Conservation,

237:125--132.

Chabot, D., Stapleton, S., and Francis, C. M. (2022). Using web images to train a deep neural

network to detect sparsely distributed wildlife in large volumes of remotely sensed imagery:

A case study of polar bears on sea ice. Ecological Informatics, page 101547.

Chalmers, C., Fergus, P., Curbelo Montanez, C. A., Longmore, S. N., and Wich, S. A.

(2021). Video analysis for the detection of animals using convolutional neural networks and

consumer-grade drones. Journal of Unmanned Vehicle Systems, 9(2):112--127.

Chandio, A., Gui, G., Kumar, T., Ullah, I., Ranjbarzadeh, R., Roy, A. M., Hussain, A., and

Shen, Y. (2022). Precise single-stage detector. arXiv preprint arXiv:2210.04252.

Chauvenet, A. L., Gill, R. M., Smith, G. C., Ward, A. I., and Massei, G. (2017). Quantifying

the bias in density estimated from distance sampling and camera trapping of unmarked

individuals. Ecological Modelling, 350:79--86.

Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J.,

Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J.,

Wang, J., Shi, J., Ouyang, W., Loy, C. C., and Lin, D. (2019). MMDetection: Open mmlab

detection toolbox and benchmark. arXiv preprint arXiv:1906.07155.

Chen, X., Zhao, J., Chen, Y.-h., Zhou, W., and Hughes, A. C. (2020). Automatic standardized

processing and identification of tropical bat calls using deep learning approaches. Biological

Conservation, 241:108269.

Cheng, G. and Han, J. (2016). A survey on object detection in optical remote sensing images.

ISPRS Journal of Photogrammetry and Remote Sensing, 117:11--28.

Choe, D.-G. and Kim, D.-K. (2020). Deep learning-based image data processing and archival

system for object detection of endangered species. Journal of information and communication

convergence engineering, 18(4):267--277.

Crooks, K., Burdett, C., Theobald, D., King, S., Marco, M. D., Rondinini, C., and Boitani, L.

(2017). Quantification of habitat fragmentation reveals extinction risk in terrestrial mammals.

Proceedings of the National Academy of Sciences, 114(29):7635--7640.

Davis, J. and Goadrich, M. (2006). The relationship between precision-recall and roc curves.

In Proceedings of the 23rd international conference on Machine learning, pages 233--240.

Delplanque, A., Foucher, S., Lejeune, P., Linchant, J., and Th´eau, J. (2021). Multispecies

detection and identification of african mammals in aerial imagery using convolutional neural

networks. Remote Sensing in Ecology and Conservation.

Desgarnier, L., Mouillot, D., Vigliola, L., Chaumont, M., and Mannocci, L. (2022). Putting eagle

rays on the map by coupling aerial video-surveys and deep learning. Biological Conservation,

267:109494.

Divya Meena, S. and Agilandeeswari, L. (2019). An efficient framework for animal breeds

classification using semi-supervised learning and multi-part convolutional neural network

(mp-cnn). IEEE Access, 7:151783--151802.

Duporge, I., Isupova, O., Reece, S., Macdonald, D. W., and Wang, T. (2021). Using

very-high-resolution satellite imagery and deep learning to detect and count african elephants

in heterogeneous landscapes. Remote sensing in ecology and conservation, 7(3):369--381.

Eikelboom, J. A., Wind, J., van de Ven, E., Kenana, L. M., Schroder, B., de Knegt, H. J., van

Langevelde, F., and Prins, H. H. (2019). Improving the precision and accuracy of animal

population estimates with aerial image object detection. Methods in Ecology and Evolution,

10(11):1875--1887.

Esser, P., Sutter, E., and Ommer, B. (2018). A variational u-net for conditional appearance and

shape generation. In Proceedings of the IEEE conference on computer vision and pattern

recognition, pages 8857--8866.

Feng, J. and Li, J. (2022). An adaptive embedding network with spatial constraints for the

use of few-shot learning in endangered-animal detection. ISPRS International Journal of

Geo-Information, 11(4):256.

Ferri, C., Hern´andez-Orallo, J., and Modroiu, R. (2009). An experimental comparison of

performance measures for classification. Pattern recognition letters, 30(1):27--38.

Ghiasi, G., Lin, T.-Y., and Le, Q. V. (2018). Dropblock: A regularization method for

convolutional networks. Advances in neural information processing systems, 31.

Girshick, R. (2015). Fast r-cnn in proceedings of the ieee international conference on computer

vision (pp. 1440--1448). Piscataway, NJ: IEEE.[Google Scholar].

Glowacz, A. (2021a). Fault diagnosis of electric impact drills using thermal imaging.

Measurement, 171:108815.

Glowacz, A. (2021b). Thermographic fault diagnosis of ventilation in bldc motors. Sensors,

21(21):7245.

Glowacz, A. (2021c). Ventilation diagnosis of angle grinder using thermal imaging. Sensors,

21(8):2853.

Gon¸calves, B. C., Spitzbart, B., and Lynch, H. J. (2020). Sealnet: A fully-automated pack-ice

seal detection pipeline for sub-meter satellite imagery. Remote Sensing of Environment,

239:111617.

Gonzalez, L. F., Montes, G. A., Puig, E., Johnson, S., Mengersen, K., and Gaston, K. J. (2016).

Unmanned aerial vehicles (uavs) and artificial intelligence revolutionizing wildlife monitoring

and conservation. Sensors, 16(1):97.

Guo, X., Shao, Q., Li, Y., Wang, Y., Wang, D., Liu, J., Fan, J., and Yang, F. (2018).

Application of uav remote sensing for a population census of large wild herbivores—taking

the headwater region of the yellow river as an example. Remote Sensing, 10(7):1041.

Han, J., Zhang, D., Cheng, G., Liu, N., and Xu, D. (2018). Advanced deep-learning techniques

for salient and category-specific object detection: a survey. IEEE Signal Processing Magazine,

35(1):84--100.

Harris, G., Thompson, R., Childs, J. L., and Sanderson, J. G. (2010). Automatic storage and

analysis of camera trap data. Bulletin of the Ecological Society of America, 91(3):352--360.

He, K., Gkioxari, G., Doll´ar, P., and Girshick, R. (2017). Mask r-cnn. in proceedings of the

ieee international conference on computer vision.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Spatial pyramid pooling in deep convolutional

networks for visual recognition. IEEE transactions on pattern analysis and machine

intelligence, 37(9):1904--1916.

He, Q., Zhao, Q., Liu, N., Chen, P., Zhang, Z., and Hou, R. (2019). Distinguishing individual

red pandas from their faces. In Lin, Z., Wang, L., Yang, J., Shi, G., Tan, T., Zheng, N.,

Chen, X., and Zhang, Y., editors, Pattern Recognition and Computer Vision, pages 714--724,

Cham. Springer International Publishing.

Hou, J., He, Y., Yang, H., Connor, T., Gao, J., Wang, Y., Zeng, Y., Zhang, J., Huang, J.,

Zheng, B., et al. (2020). Identification of animal individuals using deep learning: A case

study of giant panda. Biological Conservation, 242:108414.

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. (2017). Densely connected

convolutional networks. In Proceedings of the IEEE conference on computer vision and

pattern recognition, pages 4700--4708.

Ibraheam, M., Li, K. F., Gebali, F., and Sielecki, L. E. (2021). A performance comparison

and enhancement of animal species detection in images with various r-cnn models. AI,

2(4):552--577.

Jamil, S., Abbas, M. S., and Roy, A. M. (2022). Distinguishing malicious drones using vision

transformer. AI, 3(2):260--273.

Jask´olski, M. W. (2021). for human activity in arctic coastal environments--a review of selected

interactions and problems. Miscellanea Geographica, 25(2):127--143.

Kellenberger, B., Marcos, D., Lobry, S., and Tuia, D. (2019). Half a percent of labels is

enough: Efficient animal detection in uav imagery using deep cnns and active learning. IEEE

Transactions on Geoscience and Remote Sensing, 57(12):9524--9533.

Kellenberger, B., Marcos, D., and Tuia, D. (2018). Detecting mammals in uav images: Best

practices to address a substantially imbalanced dataset with deep learning. Remote sensing

of environment, 216:139--153.

Khaemba, W. M. and Stein, A. (2002). Improved sampling of wildlife populations using airborne

surveys. Wildlife research, 29(3):269--275.

Khan, W., Kumar, T., Cheng, Z., Raj, K., Roy, A. M., and Luo, B. (2022a). Sql and

nosql databases software architectures performance analysis and assessments--a systematic

literature review. arXiv preprint arXiv:2209.06977.

Khan, W., Raj, K., Kumar, T., Roy, A. M., and Luo, B. (2022b). Introducing urdu digits

dataset with demonstration of an efficient and robust noisy decoder-based pseudo example

generator. Symmetry, 14(10):1976.

Kim, J. S., Elli, G. V., and Bedny, M. (2019). Knowledge of animal appearance among sighted

and blind adults. Proceedings of the National Academy of Sciences, 116(23):11213--11222.

Kudo, H., Koshino, Y., Eto, A., Ichimura, M., and Kaeriyama, M. (2012). Cost-effective

accurate estimates of adult chum salmon, oncorhynchus keta, abundance in a japanese river

using a radio-controlled helicopter. Fisheries Research, 119:94--98.

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436--444.

Lee, W. Y., Park, M., and Hyun, C.-U. (2019). Detection of two arctic birds in greenland and

an endangered bird in korea using rgb and thermal cameras with an unmanned aerial vehicle

(uav). PLOS ONE, 14(9):1--16.

Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll´ar, P. (2017a). Focal loss for dense object

detection. In Proceedings of the IEEE international conference on computer vision, pages

2980--2988.

Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll´ar, P. (2017b). Focal loss for dense object

detection. In Proceedings of the IEEE international conference on computer vision, pages

2980--2988.

Liu, S., Qi, L., Qin, H., Shi, J., and Jia, J. (2018). Path aggregation network for instance

segmentation. In Proceedings of the IEEE conference on computer vision and pattern

recognition, pages 8759--8768.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., and Berg, A. (2016). Ssd:

Single shot multibox detector,‖in european conference on computer vision (eccv).

Loshchilov, I. and Hutter, F. (2017). Sgdr: Stochastic gradient descent with warm restarts.

Mannocci, L., Baidai, Y., Forget, F., Tolotti, M. T., Dagorn, L., and Capello, M. (2021).

Machine learning to detect bycatch risk: Novel application to echosounder buoys data in

tuna purse seine fisheries. Biological Conservation, 255:109004.

Meena, S. D. and Loganathan, A. (2020). Intelligent animal detection system using sparse multi

discriminative-neural network (smd-nn) to mitigate animal-vehicle collision. Environmental

Science and Pollution Research, 27:39619–39634.

Misra, D. (2020). Mish: A self regularized non-monotonic activation function.

Moreni, M., Theau, J., and Foucher, S. (2021). Train fast while reducing false positives:

Improving animal classification performance using convolutional neural networks. Geomatics,

1(1):34--49.

Naude, J. and Joubert, D. (2019). The aerial elephant dataset: A new public benchmark for

aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision

and Pattern Recognition Workshops, pages 48--55.

Norouzzadeh, M. S., Nguyen, A., Kosmala, M., Swanson, A., Palmer, M. S., Packer, C.,

and Clune, J. (2018). Automatically identifying, counting, and describing wild animals in

camera-trap images with deep learning. Proceedings of the National Academy of Sciences,

115(25):E5716--E5725.

O’Brien, T. (2010). Wildlife picture index and biodiversity monitoring: issues and future

directions. Animal Conservation, 13(4):350--352.

Ofli, F., Meier, P., Imran, M., Castillo, C., Tuia, D., Rey, N., Briant, J., Millet, P., Reinhard,

F., Parkan, M., et al. (2016). Combining human computing and machine learning to make

sense of big (aerial) data for disaster response. Big data, 4(1):47--59.

Parham, J., Stewart, C., Crall, J., Rubenstein, D., Holmberg, J., and Berger-Wolf, T. (2018). An

animal detection pipeline for identification. In 2018 IEEE Winter Conference on Applications

of Computer Vision (WACV), pages 1075--1083. IEEE.

Peng, J., Wang, D., Liao, X., Shao, Q., Sun, Z., Yue, H., and Ye, H. (2020). Wild animal

survey using uas imagery and deep learning: modified faster r-cnn for kiang detection in

tibetan plateau. ISPRS Journal of Photogrammetry and Remote Sensing, 169:364--376.

Petso, T., Jamisola, R. S., Mpoeleng, D., and Mmereki, W. (2021). Individual animal and herd

identification using custom yolo v3 and v4 with images taken from a uav camera at different

altitudes. In 2021 IEEE 6th International Conference on Signal and Image Processing

(ICSIP), pages 33--39. IEEE.

Pringle, R. M., Syfert, M., Webb, J. K., and Shine, R. (2009). Quantifying historical changes

in habitat availability for endangered species: use of pixel-and object-based remote sensing.

Journal of Applied Ecology, 46(3):544--553.

Rawat, W. and Wang, Z. (2017). Deep convolutional neural networks for image classification:

A comprehensive review. Neural computation, 29(9):2352--2449.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016). You only look once: Unified,

real-time object detection. In Proceedings of the IEEE conference on computer vision and

pattern recognition, pages 779--788.

Redmon, J. and Farhadi, A. (2017). Yolo9000: better, faster, stronger. In Proceedings of the

IEEE conference on computer vision and pattern recognition, pages 7263--7271.

Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental improvement.

Ren, S., He, K., Girshick, R., and Sun, J. (2016). Faster r-cnn: towards real-time object

detection with region proposal networks. IEEE transactions on pattern analysis and machine

intelligence, 39(6):1137--1149.

Rey, N., Volpi, M., Joost, S., and Tuia, D. (2017). Detecting animals in african savanna with

uavs and the crowds. Remote Sensing of Environment, 200:341--351.

Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., and Savarese, S. (2019). Generalized

intersection over union: A metric and a loss for bounding box regression. In Proceedings of

the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 658--666.

Roy, A. M. (2021). Finite element framework for efficient design of three dimensional

multicomponent composite helicopter rotor blade system. Eng, 2(1):69--79.

Roy, A. M. (2022a). Adaptive transfer learning-based multiscale feature fused deep convolutional

neural network for eeg mi multiclassification in brain--computer interface. Engineering

Applications of Artificial Intelligence, 116:105347.

Roy, A. M. (2022b). An efficient multi-scale CNN model with intrinsic feature integration for

motor imagery EEG subject classification in brain-machine interfaces. Biomedical Signal

Processing and Control, 74:103496.

Roy, A. M. (2022c). A multi-scale fusion cnn model based on adaptive transfer learning for

multi-class mi-classification in bci system. BioRxiv.

Roy, A. M. and Bhaduri, J. (2021). A deep learning enabled multi-class plant disease detection

model based on computer vision. AI, 2(3):413--428.

Roy, A. M. and Bhaduri, J. (2022). Real-time growth stage detection model for high degree

of occultation using densenet-fused YOLOv4. Computers and Electronics in Agriculture,

193:106694.

Roy, A. M., Bose, R., and Bhaduri, J. (2022). A fast accurate fine-grain object detection model

based on YOLOv4 deep neural network. Neural Computing and Applications, pages 1--27.

Ruff, Z. J., Lesmeister, D. B., Appel, C. L., and Sullivan, C. M. (2021). Workflow and

convolutional neural network for automated identification of animal sounds. Ecological

Indicators, 124:107419.

Saxena, A., Gupta, D. K., and Singh, S. (2021). An animal detection and collision avoidance

system using deep learning. In Advances in Communication and Computational Technology,

pages 1069--1084. Springer.

Schindler, F. and Steinhage, V. (2021). Identification of animals and recognition of their actions

in wildlife videos using deep learning techniques. Ecological Informatics, 61:101215.

Singh, A., Pietrasik, M., Natha, G., Ghouaiel, N., Brizel, K., and Ray, N. (2020). Animal

detection in man-made environments. In 2020 IEEE Winter Conference on Applications of

Computer Vision (WACV), pages 1427--1438.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).

Dropout: a simple way to prevent neural networks from overfitting. The journal of machine

learning research, 15(1):1929--1958.

Stern, E. R. and Humphries, M. M. (2022). Interweaving local, expert, and indigenous knowledge

into quantitative wildlife analyses: A systematic review. Biological Conservation, 266:109444.

Taheri, S. and

Onsen Toygar (2018). Animal classification using facial images with score-level

fusion. IET Computer Vision, 12:679--685(6).

Torney, C. J., Lloyd-Jones, D. J., Chevallier, M., Moyer, D. C., Maliti, H. T., Mwita, M.,

Kohi, E. M., and Hopcraft, G. C. (2019). A comparison of deep learning and citizen science

techniques for counting wildlife in aerial survey images. Methods in Ecology and Evolution,

10(6):779--787.

Tzutalin (2015). Labelimg.

Voulodimos, A., Doulamis, N., Doulamis, A., and Protopapadakis, E. (2018). Deep learning for

computer vision: A brief review. Computational intelligence and neuroscience, 2018.

Wang, D., Shao, Q., and Yue, H. (2019). Surveying wild animals from satellites, manned

aircraft and unmanned aerial systems (uass): A review. Remote Sensing, 11(11):1308.

Xiao, Z., Xu, X., Xing, H., Luo, S., Dai, P., and Zhan, D. (2021a). Rtfn: a robust temporal

feature network for time series classification. Information Sciences, 571:65--86.

Xiao, Z., Xu, X., Xing, H., Song, F., Wang, X., and Zhao, B. (2021b). A federated learning

system with enhanced feature extraction for human activity recognition. Knowledge-Based

Systems, 229:107338.

Xiao, Z., Xu, X., Zhang, H., and Szczerbicki, E. (2021c). A new multi-process collaborative

architecture for time series classification. Knowledge-Based Systems, 220:106934.

Xing, H., Xiao, Z., Qu, R., Zhu, Z., and Zhao, B. (2022a). An efficient federated distillation

learning system for multitask time series classification. IEEE Transactions on Instrumentation

and Measurement, 71:1--12.

Xing, H., Xiao, Z., Zhan, D., Luo, S., Dai, P., and Li, K. (2022b). Selfmatch: Robust

semisupervised time-series classification with self-distillation. International Journal of

Intelligent Systems.

Yao, Z., Cao, Y., Zheng, S., Huang, G., and Lin, S. (2021). Cross-iteration batch normalization.

Zhao, Z.-Q., Zheng, P., Xu, S.-t., and Wu, X. (2019a). Object detection with deep learning: A

review. IEEE transactions on neural networks and learning systems, 30(11):3212--3232.

Zhao, Z.-Q., Zheng, P., Xu, S.-t., and Wu, X. (2019b). Object detection with deep learning: A

review. IEEE transactions on neural networks and learning systems, 30(11):3212--3232.

Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., and Ren, D. (2020). Distance-iou loss: Faster

and better learning for bounding box regression. In Proceedings of the AAAI Conference on

Artificial Intelligence, volume 34, pages 12993--13000.

Zhu, X. X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., and Fraundorfer, F. (2017). Deep

learning in remote sensing: A comprehensive review and list of resources. IEEE Geoscience

and Remote Sensing Magazine, 5(4):8--36.

Zotin, A. G. and Proskurin, A. V. (2019). Animal detection using a series of images under

complex shooting conditions. The International Archives of the Photogrammetry, Remote

Sensing and Spatial Information Sciences, XLII-2/W12:249--257.

Efficient Paddy Grain Quality Assessment Approach Utilizing Affordable Sensors

Article

Full-text available

May 2024

Paddy (Oryza sativa) is one of the most consumed food grains in the world. The process from its sowing to consumption via harvesting, processing, storage and management require much effort and expertise. The grain quality of the product is heavily affected by the weather conditions, irrigation frequency, and many other factors. However, quality control is of immense importance, and thus, the evaluation of grain quality is necessary. Since it is necessary and arduous, we try to overcome the limitations and shortcomings of grain quality evaluation using image processing and machine learning (ML) techniques. Most existing methods are designed for rice grain quality assessment, noting that the key characteristics of paddy and rice are different. In addition, they have complex and expensive setups and utilize black-box ML models. To handle these issues, in this paper, we propose a reliable ML-based IoT paddy grain quality assessment system utilizing affordable sensors. It involves a specific data collection procedure followed by image processing with an ML-based model to predict the quality. Different explainable features are used for classifying the grain quality of paddy grain, like the shape, size, moisture, and maturity of the grain. The precision of the system was tested in real-world scenarios. To our knowledge, it is the first automated system to precisely provide an overall quality metric. The main feature of our system is its explainability in terms of utilized features and fuzzy rules, which increases the confidence and trustworthiness of the public toward its use. The grain variety used for experiments majorly belonged to the Indian Subcontinent, but it covered a significant variation in the shape and size of the grain.

Efficient Deep Learning-based Semantic Mapping Approach using Monocular Vision for Resource-Limited Mobile Robots

Article

Full-text available

Nov 2023
J INTELL ROBOT SYST

In recent years, the demand for robots is not only limited to sophisticated industrial setups, there exists an unprecedented demand for low-cost robots in living places with the capabilities of performing human-centric operations. For the semantic-rich mapping of random environments, current state-of-the-art techniques include sophisticated hardware like Kinect sensor, Lidar, deep learning (DL)-based vision, and stereo vision-based systems. Inevitably, these systems increase the cost of the product which requires expensive hardware for processing the information. It, therefore, creates a hurdle to implementing them on low-cost service robots where interaction matters more than precision. To overcome these issues, in this paper, we propose two novel techniques: 1) a light, yet efficient, semantic mapping technique for scene-wise localization of objects by combining object detection and camera geometry; 2) an accurate and robust novel integration technique for coalition of scene-wise information for large-scale maps. The main goal of this framework is to host a semantic mapping process on a limited processing device like Raspberry Pi. The semantic information can be further integrated into any Human-Robot Interaction (HRI) system. A tensorflow-lite version of Single Shot Detection (SSD) for object detection, a wheel odometer for odometry tracking, and pinhole camera geometry are used for the whole mapping process. The proposed model has demonstrated promising results by accurately mapping the environment with semantic-rich features. Current work is time efficient and suitable for object-orientated task execution of low-cost robots, such as smart toys and other smart home gadgets.

Understanding EEG signals for subject-wise Definition of Armoni Activities

Preprint

Oct 2023

In a growing world of technology, psychological disorders became a challenge to be solved. The methods used for cognitive stimulation are very conventional and based on one-way communication, which only rely on the material or method used for training of an individual. It doesn’t use any kind of feedback from the individual to analyze the progress of the training process. We have proposed a closed-loop methodology to improve the cognitive state of a person with ID (Intellectual disability). We have used a platform named ‘Armoni’, for providing training to the intellectually disabled individuals. The learning is performed in a closed-loop by using feedback in the form of change in affective state. For feedback to the Armoni, an EEG (Electroencephalograph) headband is used. All the changes in EEG are observed and classified against the change in the mean and standard deviation value of all frequency bands of signal. This comparison is being helpful in defining every activity with respect to change in brain signals. In this paper, we have discussed the process of treatment of EEG signal and its definition against the different activities of Armoni. We have tested it on 6 different systems with different age groups and cognitive levels.

Efficient Paddy Grains Quality Assessment Approach Utilizing Affordable Sensors

Preprint

Sep 2023

In the realm of computer vision, paddy (Oryza Sativa) plays a pivotal role as a globally consumed staple crop. Its cultivation, harvesting, processing, and storage involve intricate quality control. Numerous factors, including weather conditions and irrigation frequency, influence grain quality. To address this, we present an innovative approach that combines image processing and machine learning (ML). Existing methods for rice grain quality assessment, while valuable, are tailored to rice-specific characteristics, employing complex and costly setups and opaque ML models. Our research overcomes these limitations with a robust ML-based IoT system for paddy grain quality assessment, using affordable sensors, a comprehensive data collection process, and an ML-driven image processing model. Importantly, our approach utilizes interpretable features like Shape, Size, Moisture, and Maturity for paddy grain classification. Rigorous real-world testing confirms its precision, marking it as the first automated system capable of providing a reliable overall quality metric. Our system’s unique feature lies in its transparency, with clear features and fuzzy rules, inspiring confidence and trust. While our experiments primarily feature Indian Subcontinent grain varieties, the system’s adaptability to diverse paddy types is evident, contributing significantly to computer vision.

AudRandAug: Random Image Augmentations for Audio Classification

Preprint

Sep 2023

Data augmentation has proven to be effective in training neural networks. Recently, a method called RandAug was proposed, randomly selecting data augmentation techniques from a predefined search space. RandAug has demonstrated significant performance improvements for image-related tasks while imposing minimal computational overhead. However, no prior research has explored the application of RandAug specifically for audio data augmentation, which converts audio into an image-like pattern. To address this gap, we introduce AudRandAug, an adaptation of RandAug for audio data. AudRandAug selects data augmentation policies from a dedicated audio search space. To evaluate the effectiveness of AudRandAug, we conducted experiments using various models and datasets. Our findings indicate that AudRandAug outperforms other existing data augmentation methods regarding accuracy performance.

Go Together: Bridging the Gap between Learners and Teachers

Preprint

Full-text available

Jul 2023

After the pandemic, humanity has been facing different types of challenges. Social relationships, societal values, and academic and professional behavior have been hit the most. People are shifting their routines to social media and gadgets, and getting addicted to their isolation. This sudden change in their lives has caused an unusual social breakdown and endangered their mental health. In mid-2021, Pakistan's first Human Library was established under HelpingMind to overcome these effects. Despite online sessions and webinars, HelpingMind needs technology to reach the masses. In this work, we customized the UI or UX of a Go Together Mobile Application (GTMA) to meet the requirements of the client organization. A very interesting concept of the book (expert listener or psychologist) and the reader is introduced in GTMA. It offers separate dashboards, separate reviews or rating systems, booking, and venue information to engage the human reader with his or her favorite human book. The loyalty program enables the members to avail discounts through a mobile application and its membership is global where both the human-reader and human-books can register under the platform. The minimum viable product has been approved by our client organization.

DenseSPH-YOLOv5: An automated damage detection model based on DenseNet and Swin-Transformer prediction head-enabled YOLOv5 with attention mechanism -50 days' free access https://authors.elsevier.com/a/1h6JJ5FA1k5pC6

Article

Apr 2023
ADV ENG INFORM

A Computer Vision Enabled damage detection model with improved YOLOv5 based on Transformer Prediction Head

Preprint

Full-text available

Oct 2022

Objective:Computer vision-based up-to-date accurate damage classification and localization are of decisive importance for infrastructure monitoring, safety, and the serviceability of civil infrastructure. Current state-of-the-art deep learning (DL)-based damage detection models, however, often lack superior feature extraction capability in complex and noisy environments, limiting the development of accurate and reliable object distinction. Method: To this end, we present DenseSPH-YOLOv5, a real-time DL-based high-performance damage detection model where DenseNet blocks have been integrated with the backbone to improve in preserving and reusing critical feature information. Additionally, convolutional block attention modules (CBAM) have been implemented to improve attention performance mechanisms for strong and discriminating deep spatial feature extraction that results in superior detection under various challenging environments. Moreover, additional feature fusion layers and a Swin-Transformer Prediction Head (SPH) have been added leveraging advanced self-attention mechanism for more efficient detection of multiscale object sizes and simultaneously reducing the computational complexity. Results: Evaluating the model performance in large-scale Road Damage Dataset (RDD-2018), at a detection rate of 62.4 FPS, DenseSPH-YOLOv5 obtains a mean average precision (mAP) value of 85.25 %, F1-score of 81.18 %, and precision (P) value of 89.51 % outperforming current state-of-the-art models. Significance: The present research provides an effective and efficient damage localization model addressing the shortcoming of existing DL-based damage detection models by providing highly accurate localized bounding box prediction. Current work constitutes a step towards an accurate and robust automated damage detection system in real-time in-field applications.

Deep learning-accelerated computational framework based on Physics Informed Neural Network for the solution of linear elasticity

Article

Mar 2023
NEURAL NETWORKS

The paper presents an efficient and robust data-driven deep learning (DL) computational framework developed for linear continuum elasticity problems. The methodology is based on the fundamentals of the Physics Informed Neural Networks (PINNs). For an accurate representation of the field variables, a multi-objective loss function is proposed. It consists of terms corresponding to the residual of the governing partial differential equations (PDE), constitutive relations derived from the governing physics, various boundary conditions, and data-driven physical knowledge fitting terms across randomly selected collocation points in the problem domain. To this end, multiple densely connected independent artificial neural networks (ANNs), each approximating a field variable, are trained to obtain accurate solutions. Several benchmark problems including the Airy solution to elasticity and the Kirchhoff-Love plate problem are solved. Performance in terms of accuracy and robustness illustrates the superiority of the current framework showing excellent agreement with analytical solutions. The present work combines the benefits of the classical methods depending on the physical information available in analytical relations with the superior capabilities of the DL techniques in the data-driven construction of lightweight, yet accurate and robust neural networks. The models developed herein can significantly boost computational speed using minimal network parameters with easy adaptability in different computational platforms.

An efficient and robust Phonocardiography (PCG)-based Valvular Heart Diseases (VHD) detection framework using Vision Transformer (ViT) ( 50 days' free access https://authors.elsevier.com/a/1gpnJ2OYd3rI9 )

Article

Mar 2023
COMPUT BIOL MED

Background and objectives: Valvular heart diseases (VHDs) are one of the dominant causes of cardiovascular abnormalities that have been associated with high mortality rates globally. Rapid and accurate diagnosis of the early stage of VHD based on cardiac phonocardiogram (PCG) signal is critical that allows for optimum medication and reduction of mortality rate. Methods: To this end, the current study proposes novel deep learning (DL)-based high-performance VHD detection frameworks that are relatively simpler in terms of network structures, yet effective for accurately detecting multiple VHDs. We present three different frameworks considering both 1D and 2D PCG raw signals. For 1D PCG, Mel frequency cepstral coefficients (MFCC) and linear prediction cepstral coefficients (LPCC) features, whereas, for 2D PCG, various D-CNN features are extracted. Additionally, nature/bio-inspired algorithms (NIA/BIA) including particle swarm optimization (PSO) and genetic algorithm (GA) have been utilized for automatic and efficient feature selection directly from the raw PCG signal. To further improve the performance of the classifier, vision transformer (ViT) has been implemented levering the self-attention mechanism on the time frequency representation (TFR) of 2D PCG signal. Our extensive study presents a comparative performance analysis and the scope of enhancement for the combination of different descriptors, classifiers, and feature selection algorithms. Main Results: Among all classifiers, ViT provides the best performance by achieving mean average accuracy Acc of 99.90 % and F1-score of 99.95 % outperforming current state-of-the-art VHD classification models. Conclusions: The present research provides a robust and efficient DL-based end-to-end PCG signal classification framework for designing a automated high-performance VHD diagnosis system.

Systematic Review SQL and NoSQL Database Software Architecture Performance Analysis and Assessments-A Systematic Literature Review

Article

Full-text available

May 2023

The competent software architecture plays a crucial role in the difficult task of big data processing for SQL and NoSQL databases. SQL databases were created to organize data and allow for horizontal expansion. NoSQL databases, on the other hand, support horizontal scalability and can efficiently process large amounts of unstructured data. Organizational needs determine which paradigm is appropriate, yet selecting the best option is not always easy. Differences in database design are what set SQL and NoSQL databases apart. Each NoSQL database type also consistently employs a mixed-model approach. Therefore, it is challenging for cloud users to transfer their data among different cloud storage services (CSPs). There are several different paradigms being monitored by the various cloud platforms (IaaS, PaaS, SaaS, and DBaaS). The purpose of this SLR is to examine the articles that address cloud data portability and interoperability, as well as the software architectures of SQL and NoSQL databases. Numerous studies comparing the capabilities of SQL and NoSQL of databases, particularly Oracle RDBMS and NoSQL Document Database (MongoDB), in terms of scale, performance, availability, consistency, and sharding, were presented as part of the state of the art. Research indicates that NoSQL databases, with their specifically tailored structures, may be the best option for big data analytics, while SQL databases are best suited for online transaction processing (OLTP) purposes.

Introducing Urdu Digits Dataset with Demonstration of an Efficient and Robust Noisy Decoder-Based Pseudo Example Generator

Article

Full-text available

Sep 2022

In the present work, we propose a novel method utilizing only a decoder for generation of pseudo-examples, which has shown great success in image classification tasks. The proposed method is particularly constructive when the data are in a limited quantity used for semi-supervised learning (SSL) or few-shot learning (FSL). While most of the previous works have used an autoencoder to improve the classification performance for SSL, using a single autoencoder may generate confusing pseudo-examples that could degrade the classifier’s performance. On the other hand, various models that utilize encoder– decoder architecture for sample generation can significantly increase computational overhead. To address the issues mentioned above, we propose an efficient means of generating pseudo-examples by using only the generator (decoder) network separately for each class that has shown to be effective for both SSL and FSL. In our approach, the decoder is trained for each class sample using random noise, and multiple samples are generated using the trained decoder. Our generator-based approach outperforms previous state-of-the-art SSL and FSL approaches. In addition, we released the Urdu digits dataset consisting of 10,000 images, including 8000 training and 2000 test images collected through three different methods for purposes of diversity. Furthermore, we explored the effectiveness of our proposed method on the Urdu digits dataset by using both SSL and FSL, which demonstrated improvement of 3.04% and 1.50% in terms of average accuracy, respectively, illustrating the superiority of the proposed method compared to the current state-of-the-art models.

Adaptive transfer learning-based multiscale feature fused deep convolutional neural network for EEG MI multiclassification in brain–computer interface

Article

Full-text available

Nov 2022
ENG APPL ARTIF INTEL

Arunabha M Roy

Objective Deep learning (DL)-based brain–computer interface (BCI) in motor imagery (MI) has emerged as a powerful method for establishing direct communication between the brain and external electronic devices. However, due to inter-subject variability, inherent complex properties, and low signal-to-noise ratio (SNR) in electroencephalogram (EEG) signals are major challenges that significantly hinder the accuracy of the MI classifier. Approach To overcome this, the present work proposes an efficient transfer learning (TL)-based multi-scale feature fused CNN (MSFFCNN) which can capture the distinguishable features of various non-overlapping canonical frequency bands of EEG signals from different convolutional scales for multi-class MI classification. Significance In order to account for inter-subject variability from different subjects, the current work presents 4 different model variants including subject-independent and subject-adaptive classification models considering different adaptation configurations to exploit the full learning capacity of the classifier. Each adaptation configuration has been fine-tuned in an extensively trained pre-trained model and the performance of the classifier has been studied for a vast range of learning rates and degrees of adaptation which illustrates the advantages of using an adaptive transfer learning-based model. Results The model achieves an average classification accuracy of 94.06% (±0.70%) and the kappa value of 0.88 outperforming several baseline and current state-of-the-art EEG-based MI classification models with fewer training samples. The present research provides an effective and efficient transfer learning-based end-to-end MI classification framework for designing a high-performance robust MI-BCI system.

An Efficient Federated Distillation Learning System for Multi-task Time Series Classification

Article

Full-text available

Aug 2022

This paper proposes an efficient federated distillation learning system (EFDLS) for multi-task time series classification (TSC). EFDLS consists of a central server and multiple mobile users, where different users may run different TSC tasks. EFDLS has two novel components: a feature-based student-teacher (FBST) framework and a distance-based weights matching (DBWM) scheme. For each user, the FBST framework transfers knowledge from its teacher’s hidden layers to its student’s hidden layers via knowledge distillation, where the teacher and student have identical network structures. For each connected user, its student model’s hidden layers’ weights are uploaded to the EFDLS server periodically. The DBWM scheme is deployed on the server, with the least square distance used to measure the similarity between the weights of two given models. This scheme finds a partner for each connected user such that the user’s and its partner’s weights are the closest among all the weights uploaded. The server exchanges and sends back the user’s and its partner’s weights to these two users which then load the received weights to their teachers’ hidden layers. Experimental results show that compared with a number of state-of-the-art federated learning algorithms, our proposed EFDLS wins 20 out of 44 standard UCR2018 datasets and achieves the highest mean accuracy (70.14%) on these datasets. In particular, compared with a single-task Baseline, EFDLS obtains 32/4/8 regarding ’win’/’tie’/’lose’ and results in an improvement of approximately 4% in terms of mean accuracy.

SelfMatch: Robust semisupervised time-series classification with self-distillation

Article

Full-text available

Jul 2022
INT J INTELL SYST

Over the years, a number of semisupervised deep-learning algorithms have been proposed for time-series classification (TSC). In semisupervised deep learning, from the point of view of representation hierarchy, semantic information extracted from lower levels is the basis of that extracted from higher levels. The authors wonder if high-level semantic information extracted is also helpful for capturing low-level semantic information. This paper studies this problem and proposes a robust semisupervised model with self-distillation (SD) that simplifies existing semisupervised learning (SSL) techniques for TSC, called SelfMatch. SelfMatch hybridizes supervised learning, unsupervised learning, and SD. In unsupervised learning, SelfMatch applies pseudolabeling to feature extraction on labeled data. A weakly augmented sequence is used as a target to guide the prediction of a Timecut-augmented version of the same sequence. SD promotes the knowledge flow from higher to lower levels, guiding the extraction of low-level semantic information. This paper designs a feature extractor for TSC, called ResNet–LSTMaN, responsible for feature and relation extraction. The experimental results show that SelfMatch achieves excellent SSL performance on 35 widely adopted UCR2018 data sets, compared with a number of state-of-the-art semisupervised and supervised algorithms.

An Adaptive Embedding Network with Spatial Constraints for the Use of Few-Shot Learning in Endangered-Animal Detection

Article

Full-text available

Apr 2022
ISPRS

Image recording is now ubiquitous in the fields of endangered-animal conservation and GIS. However, endangered animals are rarely seen, and, thus, only a few samples of images of them are available. In particular, the study of endangered-animal detection has a vital spatial component. We propose an adaptive, few-shot learning approach to endangered-animal detection through data augmentation by applying constraints on the mixture of foreground and background images based on species distributions. First, the pre-trained, salient network U2-Net segments the foregrounds and backgrounds of images of endangered animals. Then, the pre-trained image completion network CR-Fill is used to repair the incomplete environment. Furthermore, our approach identifies a foreground–background mixture of different images to produce multiple new image examples, using the relation network to permit a more realistic mixture of foreground and background images. It does not require further supervision, and it is easy to embed into existing networks, which learn to compensate for the uncertainties and nonstationarities of few-shot learning. Our experimental results are in excellent agreement with theoretical predictions by different evaluation metrics, and they unveil the future potential of video surveillance to address endangered-animal detection in studies of their behavior and conservation.

Distinguishing Malicious Drones Using Vision Transformer

Article

Full-text available

Mar 2022

Drones are commonly used in numerous applications, such as surveillance, navigation, spraying pesticides in autonomous agricultural systems, various military services, etc., due to their variable sizes and workloads. However, malicious drones that carry harmful objects are often adversely used to intrude restricted areas and attack critical public places. Thus, the timely detection of malicious drones can prevent potential harm. This article proposes a vision transformer (ViT) based framework to distinguish between drones and malicious drones. In the proposed ViT based model, drone images are split into fixed-size patches; then, linearly embeddings and position embeddings are applied, and the resulting sequence of vectors is finally fed to a standard ViT encoder. During classification, an additional learnable classification token associated to the sequence is used. The proposed framework is compared with several handcrafted and deep convolutional neural networks (D-CNN), which reveal that the proposed model has achieved an accuracy of 98.3%, outperforming various handcrafted and D-CNNs models. Additionally, the superiority of the proposed model is illustrated by comparing it with the existing state-of-the-art drone-detection methods.

Putting eagle rays on the map by coupling aerial video-surveys and deep learning

Article

Full-text available

Mar 2022
BIOL CONSERV

Reliable and efficient techniques are urgently needed to monitor elasmobranch populations that face increasing threats worldwide. Aerial video-surveys provide precise and verifiable observations for the rapid assessment of species distribution and abundance in coral reefs, but the manual processing of videos is a major bottleneck for timely conservation applications. In this study, we applied deep learning for the automated detection and mapping of vulnerable eagle rays from aerial videos. A light aircraft dedicated to touristic flights allowed us to collect 42 h of aerial video footage over a shallow coral lagoon in New Caledonia (Southwest Pacific). We extracted the videos at a rate of one image per second before annotating them, yielding 314 images with eagle rays. We then trained a convolutional neural network with 80% of the eagle ray images and evaluated its accuracy on the remaining 20% (independent data sets). Our deep learning model detected 92% of the annotated eagle rays in a diversity of habitats and acquisition conditions across the studied coral lagoon. Our study offers a potential breakthrough for the monitoring of ray populations in coral reef ecosystems by providing a fast and accurate alternative to the manual processing of aerial videos. Our deep learning approach can be extended to the detection of other elasmobranchs and applied to systematic aerial surveys to not only detect individuals but also estimate species density in coral reef habitats.

A Deep Learning Enabled Multi-Class Plant Disease Detection Model Based on Computer Vision

Article

Full-text available

Aug 2021
ARTIF INTELL

In this paper, a deep learning enabled object detection model for multi-class plant disease has been proposed based on a state-of-the-art computer vision algorithm. While most existing models are limited to disease detection on a large scale, the current model addresses the accurate detection of fine-grained, multi-scale early disease detection. The proposed model has been improved to optimize for both detection speed and accuracy and applied to multi-class apple plant disease detection in the real environment. The mean average precision (mAP) and F1-score of the detection model reached up to 91.2% and 95.9%, respectively, at a detection rate of 56.9 FPS. The overall detection result demonstrates that the current algorithm significantly outperforms the state-of-the-art detection model with a 9.05% increase in precision and 7.6% increase in F1-score. The proposed model can be employed as an effective and efficient method to detect different apple plant diseases under complex orchard scenarios.

Individual Animal and Herd Identification Using Custom YOLO v3 and v4 with Images Taken from a UAV Camera at Different Altitudes

Conference Paper

Oct 2021

A Computer Vision-Based Object Localization Model for Endangered Wildlife Detection

Recommended publications

Seam Carving Detection and Localization Using Two-Stage Deep Neural Networks

Seam Carving Detection and Localization using Two-Stage Deep Neural Networks

A Novel Indoor Localization Approach Using Dynamic Changes in Ultrasonic Echoes

Eye detection for eyeglass wearers in iris recognition