Conference PaperPDF Available

FFT-UAVNet: FFT Based Human Action Recognition for Drone Surveillance System

March 2024

March 2024

DOI:10.1109/STI59863.2023.10465205

Conference: 2023 5th International Conference on Sustainable Technologies for Industry 5.0 (STI)
At: Dhaka, Bangladesh

Authors:

Ahsan Imran

University of Dhaka

Md Mehedi Hasan

University of Dhaka

Unmanned aerial vehicles (UAVs) have emerged as a transformative technology for human action recognition, providing a birds-eye view and unlocking new possibilities for precise and comprehensive support in surveillance systems. While substantial advances in ground-based human action recognition have been achieved, the unique characteristics of UAV footage present new challenges that require tailored solutions. Specifically, the reduced scale of humans in aerial perspectives necessitates the development of specialised models to accurately recognize and interpret human actions. Our research focuses on modifying the well-established C3D model and incorporating Fast Fourier Transform (FFT)-based object disentanglement (FO) and space-time attention (FA) mechanisms. By leveraging the power of FFT, our model effectively disentangles the human actors from the background and captures the spatio-temporal dynamics of human actions in UAV footage, enhancing the discriminative capabilities and enabling accurate action recognition. Through extensive experimentation on a subset of the UAV-Human dataset, our proposed FFT-UAVNet (m-C3D+FO&FA+FC) model demonstrates remarkable improvements in performance. We achieve a Top-1 accuracy of 64.86% and a Top-3 accuracy of 83.37%, surpassing the results obtained by the standard C3D and X3D methods, which achieve only a Top-1 accuracy of 28.05% and 31.33%, respectively. These findings underscore the efficacy of our approach and emphasize the significance of the proposed model for UAV datasets in maximizing the potential of UAV-based human action recognition.

ROC curve of the proposed model

…

Confusion matrix of the trained model

…

Figures - uploaded by Ahsan Imran

Content may be subject to copyright.

Content uploaded by Ahsan Imran

Content may be subject to copyright.

2023 5th International Conference on Sustainable Technologies for

Industry 5.0 (STI), 9-10 December, Dhaka

FFT-UAVNet: FFT Based Human Action

Recognition for Drone Surveillance System

Abdul Monaf Chowdhury∗, Ahsan Imran∗, and Md Mehedi Hasan∗

∗Dept. of Robotics and Mechatronics Engineering, University of Dhaka, Bangladesh

Email: monafabdul15@gmail.com, ahsaanimraan@gmail.com, mmhasan@du.ac.bd

Abstract—Unmanned aerial vehicles (UAVs) have emerged

as a transformative technology for human action recognition,

providing a birds-eye view and unlocking new possibilities for

precise and comprehensive support in surveillance systems. While

substantial advances in ground-based human action recogni-

tion have been achieved, the unique characteristics of UAV

footage present new challenges that require tailored solutions.

Speciﬁcally, the reduced scale of humans in aerial perspectives

necessitates the development of specialised models to accurately

recognize and interpret human actions. Our research focuses on

modifying the well-established C3D model and incorporating Fast

Fourier Transform (FFT)-based object disentanglement (FO) and

space-time attention (FA) mechanisms. By leveraging the power

of FFT, our model effectively disentangles the human actors

from the background and captures the spatio-temporal dynamics

of human actions in UAV footage, enhancing the discriminative

capabilities and enabling accurate action recognition. Through

extensive experimentation on a subset of the UAV-Human dataset,

our proposed FFT-UAVNet (m-C3D+FO&FA+FC) model demon-

strates remarkable improvements in performance. We achieve a

Top-1 accuracy of 64.86% and a Top-3 accuracy of 83.37%,

surpassing the results obtained by the standard C3D and X3D

methods, which achieve only a Top-1 accuracy of 28.05% and

31.33%, respectively. These ﬁndings underscore the efﬁcacy of

our approach and emphasize the signiﬁcance of the proposed

model for UAV datasets in maximizing the potential of UAV-

based human action recognition.

Index Terms—UAV, Human Action Recognition, Computer

Vision, Surveillance

I. INTRODUCTION

UAV-based action recognition has the potential to revolu-

tionise the surveillance system and improve the safety and

security of human lives. The widespread implementation of

this technology will result in an increase in criminal activity

detection at previously unimaginable levels. By leveraging the

unique advantages of UAVs, such as their aerial perspective,

mobility, and action recognition capabilities, it is possible to

develop an effective and efﬁcient methodology for recognizing

and classifying human actions in diverse scenarios which is

necessary for Industry 5.0. Imagine, for instance, a scenario

like Fig 1, where during major events, UAVs are used in public

areas to keep an eye on the people. The system ensures the

safety of eventgoers by identifying and ﬂagging questionable

activities (such as hitting someone, running through a crowd,

or inciting a disturbance) in real-time. Real-time human ac-

tion recognition will allow sophisticated systems to support

emergency response teams.

Fig. 1: UAV-based Human Action Recognition System

Human action recognition systems have been the focus of a

signiﬁcant amount of research throughout the years, which has

resulted in signiﬁcant technological improvements. However,

the majority of the work has been accomplished with the use

of ground cameras, which produce static videos. When applied

to the footage acquired by the UAVs, all of this falls ﬂat on

its face, despite the fact that it generates excellent results on

ground-based datasets. Performance declined on UAV-based

dataset for the state-of-the-art action recognition system known

as I3D [1]. On two aerial action recognition datasets, UAV-

Human [2] and UCF-Aerial, I3D only achieved 23.86% and

16.8%, respectively. I3D [1] scored 98.0% and 80.9% on

two prominent ground-based action datasets, UCF-101 [3] and

HMDB-51 [4], respectively. To extract meaningful features

from the video data captured by the UAV, advanced computer

vision techniques are employed that have demonstrated strong

performance in extracting spatio-temporal features from video

data. Activity recognition has extensively employed deep

learning techniques [5][1][6]. Analyzing videos of scenes cap-

tured by UAV cameras [2] is considerably more challenging

compared to recognizing activities in datasets from ground

cameras [1]. In addition to modifying traditional feature ex-

traction methods, we incorporate innovative techniques that

take advantage of the unique characteristics of UAV-based

data. For example, Fourier action recognition is utilised to

encode long-range spatio-temporal correlations and automati-

cally separate individuals from the background. By exploit-

ing the convolution-multiplication properties of the Fourier

transform, we can effectively represent and analyse the object-

background entangled characteristics observed from the UAV

perspective. The extracted features are then utilised to train

and evaluate action recognition models. Careful attention has

been given to ensure the proposed approach’s dependability

and robustness. This paper’s contributions are as follows:

•A dependable system has been developed that can identify

human actions in a variety of demanding environments

employing UAV in a way that is stable, scalable, and

ﬂexible for surveillance purposes.

•The modiﬁed 3D convolutional Neural Network has been

developed by reducing parameters and empowering it

with Fast Fourier Transform modules which transform

high-level features in the frequency domain and perform

human object disentanglement as well as ﬁnd temporal

dependency in consecutive frames.

•A higher degree of accuracy has been achieved in the

identiﬁcation of human activity in different actions.

II. RE LATE D WOR KS

Aerial video activity detection is a difﬁcult task because

of small target sizes, camera mobility, and scarce datasets.

The use of deep learning has yielded encouraging results

in this sector over the years, and numerous methods have

been presented to handle the special issues posed by UAV-

based recordings. Yu et al. developed a ground-breaking 3D

Convolutional Neural Network (CNN) model for human action

recognition[7]. This model extracts motion information from

successive video frames, capturing the temporal dimension in

addition to the spatial dimension. On conventional datasets,

the model produced state-of-the-art results, outperforming

earlier techniques using 2D CNNs or handmade features.

However, it may struggle with occlusions and necessitates

a signiﬁcant amount of data for training. Shinde et al.

used the YOLO object detection algorithm for human action

recognition and localization [8]. They modiﬁed YOLO to

predict class probabilities and bounding boxes for various

actions. The model demonstrated real-time implementation

and competitive performance on the Liris Human Activities

dataset. However, it may face challenges in handling com-

plex backgrounds and recognizing subtle actions. Peng et al.

proposed an automated action recognition system based on

deep learning for UAV aerial imagery [9]. The components

of the system are action recognition, video stabilization, and

human action area detection. They introduced an updated

version of InceptionResNet-v2 called Inception-ResNet-3D for

action recognition, achieving high accuracy on the UCF-ARG

dataset. However, the system may face challenges with fast

drone movements and poor lighting conditions. Ahmad et

al. combined YoloV5 with stochastic gradient boosting for

detecting human actions in drone videos [10]. Their hybrid

approach achieved real-time implementation and competitive

performance on the Okutama-Action dataset. However, it may

have limitations in detecting actions in complex backgrounds

and recognizing slow or subtle actions. Ding et al. presented

a lightweight action recognition framework called LARMUV,

using MobileNetV3 as the feature extraction network [11].

They introduced self-attention for capturing temporal struc-

ture and used focal loss for better optimization. LARMUV

achieved real-time implementation and competitive perfor-

mance on standard datasets. However, it may struggle with

recognizing actions of different sizes and detecting fast or

subtle actions. Xian et al. introduced AZTR, a method for

aerial video action recognition that utilizes temporal reasoning

and auto zoom [12]. The auto zoom technique effectively

isolates the human actor from the background, while tem-

poral reasoning captures long-range space-time dependencies.

AZTR achieved signiﬁcant improvements in accuracy and real-

time performance on various UAV datasets. However, it may

have limitations in handling complex localization scenarios.

Wang et al. proposed Fourier Activity Recognition (FAR),

a technique for detecting human actions in UAV recordings

[13]. FAR employs Fourier object disentanglement to isolate

the human agent from the backdrop and Fourier attention for

long-range space-time reasoning. FAR achieved substantial

improvements in accuracy and computational efﬁciency on

multiple UAV datasets. However, it may have limitations in

handling multiple actions in a single frame.

III. METHODOLOGY

Our solution focuses on human-action recognition on un-

manned aerial vehicle datasets. The proposed model is a

combination of a modiﬁed C3D (m-C3D) deep learning model,

Fast Fourier Transform (FFT) in the frequency domain and

Fully Connected layers. The proposed model’s detailed archi-

tecture is shown in Figure 2.

A. The Preprocessing Block

The preprocessing block is used to extract samples from

video ﬁles and prepare them for further processing. The block

extracts features from video frames by resizing them to a

standardized size (112x112). The block generates sequences

of frames, also known as chunks, to capture the temporal

information in videos. Each chunk consists of 16 consecutive

frames.

B. Modiﬁed Convolutional 3D(m-C3D)Model

The C3D model architecture is a state-of-the-art approach

for video analysis and classiﬁcation. It incorporates 3D con-

volutions to obtain both temporal and spatial information. The

model consists of convolutional layers followed by max pool-

ing operations organized into several layer groups. After the

convolutional layers, two fully connected layers are used with

Video

Pre-processing

Feature Extraction(m-C3D)

Fourier Object

Disentanglement

Fourier Space-

Time Attention

Fully Connected Layer

Classification

Label

Resized 16-frame

chunks

Fig. 2: Proposed Architecture incorporating a Preprocessing Block, m-C3D Block(sky-blue), Fourier Object Disentanglement

Block(green), Fourier Space-Time Attention Block(red), and a Fully Connected Layer followed by 10 Dense Layer with

Softmax Activation Function

dropout regularization to prevent overﬁtting. The ﬁnal output

layer generates class probabilities for video clip classiﬁcation.

In this work, the m-C3D model utilizes the “fc6” layer

to capture more general features, preventing overﬁtting and

promoting transferability and generalization across tasks and

datasets. This modiﬁed version of the conventional 3D CNN

model proved to be performing well in UAV-dataset. We

validated this in the experiment section. By leveraging the

learned representations from a pre-trained C3D model [14],

the model’s performance is improved in speciﬁc classiﬁcation

tasks.

C. Fourier Disentangled Space Time Attention

This section focuses on the module utilized for decoding

human actors’ actions and encoding contextual information.

Fourier Object Disentanglement (FO) automatically separates

the object from the background, while Fourier Space-Time

Attention (FA) incorporates self-attention properties to capture

extensive spatial and temporal relationships at a reduced

computational burden.

D. Fourier Object Disentanglement

In this research, we have used the Fourier Object Disentan-

glement (FO) approach, which effectively addresses the task

of automatically isolating the human agent from the surround-

ings in surveillance scenarios. The movement of the humans

within a scene can be effectively captured by examining the

temporal variations in the feature maps that encode the spatial

information of the video frames across the dimensions of the

scene (H×W).

In order to detect and characterize movement, we begin

by transforming the feature maps into a temporal frequency

space. This transformation allows us to examine the signal’s

behaviour across different temporal frequencies and extract

valuable information regarding the presence of movement in

the scene.

In our approach, we utilize a 1D Fourier transform along the

temporal dimension to perform the necessary computations.

Let feature maps be represented by f(c, t, h, w)∈C×T0×

H0×W0on which the Fourier Object Disentanglement (FO)

method is applied. Here, (H0×W0)and T0denote the spatial

and temporal dimensions of the feature maps, while Cdenotes

the number of channels. The temporal Fourier transform’s

amplitude at a particular frequency, denoted by −2πk/N, is

calculated as follows:

FT(f)(k) =

n=T0

n=0

f(c, t, h, w)×e−2πkn/N .(1)

To efﬁciently compute this transform, we employ the Fast

Fourier Transform (FFT) algorithm [15], which provides an

optimized solution for this task.

For each spatial and channel location in the feature map

f, the amplitude of the temporal signal is captured mathe-

matically as FT(f)(k). In simple terms, higher frequencies in

the temporal dimension correspond to movement, while lower

frequencies indicate static regions in the scene. Consequently,

areas associated with the human actor’s motion should exhibit

higher amplitudes in the Fourier transform at higher frequen-

cies.

It is important to note that the frequencies utilized in the

Fourier Object Disentanglement (FO) technique are indepen-

dent of the input video. Consequently, we can express the

dynamic mask MF O as:

MF O =kFT(f)(k)k2

2× kfrkk2

2(2)

where kak2

2denotes the squared L2-norm of a vector kak

. The dynamic mask MF O serves to disentangle or amplify

the regions in the scene that correspond to moving pixels.

It is important to note that these regions may include both

the moving background (including camera motion) and the

moving human actor. Our subsequent task involves using

MF O to distinguish the shifting object pixels from the moving

background pixels.

We use the model’s activation maps fto isolate the moving

actor. Although not ﬂawless, these activations tend to be higher

in salient regions of the scene compared to non-salient regions.

As a result, the ﬁnal representation of the disentangled object

can be obtained by taking the dot product of network features

fand MF O, thereby amplifying the dynamic and prominent

areas throughout the frame. Mathematically,

FF O =fMF O .(3)

The disentangled object representation, denoted as FF O,

can be obtained by element-wise multiplication (Hadamard

product) of the activation maps fand the dynamic mask MFO ,

as shown in Equation 3.

E. Space-Time Fourier Attention

In some scenarios foreground and background images are

interconnected. Also, consecutive frames are dependent tem-

porally in the action recognition scenario. Although explicitly

modelling the relationships between individual pixels that

represent orientations, joint motions, and positions may be

unnecessary, it remains essential for the neural network to

recognize and learn these aspects autonomously. Space-time

self-attention has been proven effective in extracting such

knowledge for video action recognition. Several studies such

as [6][16] have explored the use of space-time self-attention

mechanisms to capture temporal and spatial dependencies in

video data. However, these approaches often involve compu-

tationally expensive matrix multiplications, which can limit

their practical applicability. Hence, it is important to consider

the computational cost associated with these approaches. By

leveraging the power of Fourier transformation, FA achieves

this approximation in a computationally efﬁcient manner [17].

The self-attention mechanism relies on key, query, and value

vectors as input, which are derived from a shared input feature

map through 1×1convolutions. According to Vaswani et

al. [18], self-attention is computed by adding up the weights

allocated to each value based on a compatibility function that

evaluates how closely the query and the key matches up. This

compatibility function plays a crucial role in determining the

relevance or importance of each value for the given query.

The key, query, and value components in the self-attention

mechanism are obtained through 1×1convolution layers

applied to the input feature maps. Mathematically, let xdenote

the input feature maps, and represent matrix multiplication.

The attention computation can be expressed as follows:

Attention = Value(x) [Query(x)TKey(x)]T(4)

The space-time Fourier attention method operates in the

following manner. Initially, a representation analogous to the

key-query computation is obtained, referred to as the Fourier

sub-attention. The concept of Fourier sub-attention draws in-

spiration from autocorrelation, which quantiﬁes the correlation

coefﬁcient between distinct segments of a given signal.

Sub-attention in the Fourier domain involves taking the

element-wise product of the complex conjugate of the Fourier

transform of the feature maps with the original feature maps.

To obtain the space-time Fourier sub-attention, the video

feature maps (f) are translated to the frequency domain

through a 2D Fourier transform along the spatial and temporal

axes, resulting in a 3D representation (C×T0×(HW )). The

transformation is expressed using the equation:

FST (f)(m, n) = X

h,w

f(c, t, h, w)e−2πmh/M e−2πnw/N ,(5)

where mand nrepresent the frequency indices, Mand N

are the dimensions of the spatial and temporal axes respec-

tively. The Fast Fourier Transform (FFT) algorithm [15] is

employed for efﬁcient computation of the Fourier transform.

FFT allows for extensive global interactions between dis-

tinct temporal and spatial regions in the video by expressing

the signal as entirety across a wide range of frequencies.

Multiplying FST by its complex conjugate FS T

∗yields the

space-time Fourier sub-attentionAST in the Fourier domain,

as shown in Equation 6:

AST =FS T × FST

∗(6)

For obtaining the correlations in the time domain, we com-

pute the inverse Fast Fourier Transform (IF ) of the space-time

Fourier sub-attention AST . The resulting correlation maps are

then reshaped to match the dimensions of the input feature

maps (C×T0×H0×W0).

The input feature maps are combined with the sub-attention

weights using a dot product method. Final space-time Fourier

attention maps fFAare computed using f. A scaling factor

λF A is empirically chosen to be 0.01 to scale the Fourier atten-

tion maps. The following equation describes the combination

of the input feature maps and the scaled attention maps:

fF A =F+λF A × I F (AS T )(7)

F. Creating the End-to-End Model

In the proposed approach, a pre-trained C3D model is

utilized as a feature extractor, which has been trained on

a large-scale sports video dataset [19]. To ensure that the

pre-trained weights remain unchanged during the new task’s

training, the layers of the pre-trained model are set to be non-

trainable. The output of the pre-trained model then undergoes

a series of operations, including disentanglement and spatial

causality functions. These operations enhance the feature

representation by separating temporal and spatial information

and emphasizing spatially coherent patterns. To further im-

prove generalization and prevent overﬁtting, dropout layers are

added to the model. Subsequently, fully connected layers are

incorporated to map the enhanced features to the output classes

corresponding to the speciﬁc task. Finally, the end-to-end

model is compiled using an appropriate loss function, such as

sparse categorical cross-entropy, and an optimizer, commonly

Adam. Moreover, additional evaluation metrics, like accuracy,

can be speciﬁed to assess the model’s performance [19].

IV. EXP ER IM EN T

This experiment section presents a comprehensive analysis

and evaluation of the human action recognition system using

unmanned aerial vehicles (UAVs). The proposed methodology

was applied using the UAV human [2] dataset. With 119

people and 67,428 multi-modal video sequences for action

identiﬁcation, UAV Human offers a benchmark for understand-

ing human behaviour. The UAV human dataset contains 155

action recognition classes making it suitable for recognising a

wide variety of human actions.

For our surveillance-speciﬁc task, we made a strategic

decision to narrow down the dataset and focus on a subset

of classes that were particularly relevant to our research

objective. Speciﬁcally, we selected ten action classes that

were characterised by their association with violent actions.

These classes were chosen based on their similarities and

relevance to violence, as it is an important aspect to consider

in UAV-based surveillance scenarios. These action classes

were punching someone, kicking someone, pushing someone,

slapping someone on the back, holding someone hostage,

threatening someone with a knife, threatening someone with a

gun, dragging someone, calling for help, and stabbing someone

with a knife.

Here, the dataset consisted of a total of 10 action classes,

with each class containing 30 videos. To ensure an appropriate

division of the dataset for training and testing purposes, a split

ratio of 80:20 was employed.

During the training phase, features were retrieved from the

dataset and it was preprocessed. For the sake of computational

performance, they were stored utilising memory mapping.

After some iterations of training, the suggested model was

applied to the data subset, yielding the following ﬁndings:

•Loss = 0.992

•Top-1 Accuracy = 64.86%

•Top-3 Accuracy = 83.37%

The left portion of Fig 3 provides a visual representation

of the loss function during the training of the model for ﬁve

epochs. The plot shows how the loss function changes over

the course of these initial ﬁve epochs. With each successive

epoch, there is a sharp decrease in the loss of the model.

This decreasing trend suggests that the model’s performance

improves as it undergoes more epochs.

As the loss of the model decreased during the training

process, there was a noticeable increase in the overall accuracy

of the system, as demonstrated on the right side of Fig

Fig. 3: On the left loss function of the trained model and on

the right accuracy curve of the proposed model

Fig. 4: ROC curve of the proposed model

3. The improvement in accuracy suggests that the model’s

performance was enhanced as it learned from the training data

and iteratively updated its parameters to minimise the loss.

Furthermore, the validation set, used to evaluate the model’s

generalisation, exhibited even better results in detecting human

actions compared to the training set.

The Receiver Operating Characteristic (ROC) curve in Fig

4 involved 10 distinct classes. For a range of threshold values,

the ROC curve plots the true positive rate (sensitivity) against

the false positive rate (1-speciﬁcity). By plotting these values,

we gained insights into how well the model distinguished

the positive class from the negative classes, independently

for each of the ten classes. As this is a multi-class action

recognition problem, we used a One-vs-All(OvA) approach.

To assess each class’s area under the ROC curve, one class is

set positive and the others are set negative. Ultimately, a single

graph is created by combining all of the ROC classes. Our

results revealed that the proposed model achieved excellent

discrimination performance for most classes, as evidenced by

the high area under the ROC curve (AUC) values consistently

above 0.8.

To gain deeper insights into the model’s classiﬁcation accu-

racy, we utilised a confusion matrix analysis. For our ten-class

problem, the confusion matrix in Fig 5 is a 10x10 matrix,

where every column represents the anticipated class, while

every row represents the actual class. The main diagonal of

the confusion matrix represents the true positive predictions for

each class, indicating the number of instances correctly classi-

ﬁed for each class. For example, the value 80 in the eighth row

and eighth column of the confusion matrix indicates that the

model correctly classiﬁed 80 instances of class 8 (Call for help

- The multiple classes are numericalized for computational

simplicity). Off-diagonal elements signify misclassiﬁcations.

For example, the value 8 in the top right corner of the

confusion matrix indicates that the model incorrectly classiﬁed

8 instances of class 0 (Punching someone) as class 9 (Stabbing

Fig. 5: Confusion matrix of the trained model

someone with a knife). The confusion matrix in Figure 6

illustrates that the model performed well overall, with a high

percentage of true positives for all classes.

The proposed model’s effectiveness was further assessed by

comparing it with two state-of-the-art models, namely C3D

[20] and X3D [5] without making any modiﬁcations to the

network architectures as can be seen in Table I. Two fully

connected layers at the end of both networks were used to

recognize actions. The same dataset was used to train and

assess the C3D and X3D models for the same set of classes

and the same number of epochs. The outcomes of the C3D and

X3D models were notably inferior to those of the suggested

model. Compared to X3D and C3D, the suggested model

outperformed both in terms of top-1 accuracy by 33.53% and

36.81%, respectively. FFT-UAVNet also outperformed both

X3D and C3D in terms of top-3 accuracy, by 24.72% and

37.77%, respectively.

TABLE I: Evaluation on C3D, X3D, and proposed model

Method Loss Top-1 Accuracy Top-3 Accuracy

C3D [20] 2.145 28.05% 58.65%

X3D [5] 1.977 31.33% 45.60%

Proposed Method(FFT-UAVNet) 0.992 64.86% 83.37%

This observation indicated that the proposed method demon-

strated an improvement in action recognition and detection

compared to the C3D and X3D models. It suggested that the

proposed model was more effective in accurately evaluating

and detecting various human actions within the given dataset.

V. CONCLUSION

We have developed a UAV-based system for human ac-

tion recognition in video data, aiming to improve precision

and speed in practical applications. Through extensive re-

search, we addressed limitations of traditional approaches and

achieved signiﬁcant improvements using UAVs. Our frame-

work combined modiﬁed C3D algorithms with Fourier Action

Recognition to capture spatial and temporal information in

aerial videos. The deep learning architectures enabled precise

detection and classiﬁcation of human actions. Experimen-

tal results validated the effectiveness and robustness of our

method, showing improved accuracy and efﬁciency compared

to conventional C3D and X3D approaches. The scalability and

adaptability of our framework make it suitable for various

applications, especially surveillance.

REFERENCES

[1] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new

model and the kinetics dataset,” in proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.

[2] T. Li, J. Liu, W. Zhang, Y. Ni, W. Wang, and Z. Li, “Uav-human: A large

benchmark for human behavior understanding with unmanned aerial

vehicles,” in Proceedings of the IEEE/CVF conference on computer

vision and pattern recognition, 2021, pp. 16 266–16 275.

[3] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human

actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402,

2012.

[4] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a

large video database for human motion recognition,” in 2011 Interna-

tional conference on computer vision. IEEE, 2011, pp. 2556–2563.

[5] C. Feichtenhofer, “X3d: Expanding architectures for efﬁcient video

recognition,” in Proceedings of the IEEE/CVF conference on computer

vision and pattern recognition, 2020, pp. 203–213.

[6] G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all

you need for video understanding?” in ICML, vol. 2, no. 3, 2021, p. 4.

[7] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks

for human action recognition,” IEEE transactions on pattern analysis

and machine intelligence, vol. 35, no. 1, pp. 221–231, 2012.

[8] S. Shinde, A. Kothari, and V. Gupta, “Yolo based human action

recognition and localization,” Procedia computer science, vol. 133, pp.

831–838, 2018.

[9] H. Peng and A. Razi, “Fully autonomous uav-based action recognition

system using aerial imagery,” in International symposium on visual

computing. Springer, 2020, pp. 276–290.

[10] T. Ahmad, M. Cavazza, Y. Matsuo, and H. Prendinger, “Detecting human

actions in drone images using yolov5 and stochastic gradient boosting,”

Sensors, vol. 22, no. 18, p. 7020, 2022.

[11] M. Ding, N. Li, Z. Song, R. Zhang, X. Zhang, and H. Zhou,

“A lightweight action recognition method for unmanned-aerial-vehicle

video,” in 2020 IEEE 3rd International Conference on Electronics and

Communication Engineering (ICECE). IEEE, 2020, pp. 181–185.

[12] X. Wang, R. Xian, T. Guan, C. M. de Melo, S. M. Nogar, A. Bera, and

D. Manocha, “Aztr: Aerial video action recognition with auto zoom and

temporal reasoning,” arXiv preprint arXiv:2303.01589, 2023.

[13] D. Kothandaraman, T. Guan, X. Wang, S. Hu, M. Lin, and D. Manocha,

“Far: Fourier aerial video recognition,” in European Conference on

Computer Vision. Springer, 2022, pp. 657–676.

[14] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning

spatiotemporal features with 3d convolutional networks,” in Proceedings

of the IEEE international conference on computer vision, 2015, pp.

4489–4497.

[15] M. Frigo and S. G. Johnson, “Fftw: An adaptive software architecture

for the fft,” in Proceedings of the 1998 IEEE International Conference

on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No.

98CH36181), vol. 3. IEEE, 1998, pp. 1381–1384.

[16] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video

swin transformer,” arXiv preprint arXiv:2106.13230, 2021.

[17] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention

generative adversarial networks,” in International conference on machine

learning. PMLR, 2019, pp. 7354–7363.

[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,

Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances

in neural information processing systems, 2017, pp. 5998–6008.

[19] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and

L. Fei-Fei, “Large-scale video classiﬁcation with convolutional neural

networks,” in Proceedings of the IEEE conference on Computer Vision

and Pattern Recognition, 2014, pp. 1725–1732.

[20] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning

spatiotemporal features with 3d convolutional networks,” in Proceedings

of the IEEE international conference on computer vision, 2015, pp.

4489–4497.

ResearchGate has not been able to resolve any citations for this publication.

Detecting Human Actions in Drone Images Using YoloV5 and Stochastic Gradient Boosting

Article

Full-text available

Sep 2022
SENSORS-BASEL

Human action recognition and detection from unmanned aerial vehicles (UAVs), or drones, has emerged as a popular technical challenge in recent years, since it is related to many use case scenarios from environmental monitoring to search and rescue. It faces a number of difficulties mainly due to image acquisition and contents, and processing constraints. Since drones’ flying conditions constrain image acquisition, human subjects may appear in images at variable scales, orientations, and occlusion, which makes action recognition more difficult. We explore low-resource methods for ML (machine learning)-based action recognition using a previously collected real-world dataset (the “Okutama-Action” dataset). This dataset contains representative situations for action recognition, yet is controlled for image acquisition parameters such as camera angle or flight altitude. We investigate a combination of object recognition and classifier techniques to support single-image action identification. Our architecture integrates YoloV5 with a gradient boosting classifier; the rationale is to use a scalable and efficient object recognition system coupled with a classifier that is able to incorporate samples of variable difficulty. In an ablation study, we test different architectures of YoloV5 and evaluate the performance of our method on Okutama-Action dataset. Our approach outperformed previous architectures applied to the Okutama dataset, which differed by their object identification and classification pipeline: we hypothesize that this is a consequence of both Yolov5 performance and the overall adequacy of our pipeline to the specificities of the Okutama dataset in terms of bias–variance tradeoff.

UAV-Human: A Large Benchmark for Human Behavior Understanding with Unmanned Aerial Vehicles

Conference Paper

Full-text available

Jun 2021

Human behavior understanding with unmanned aerial vehicles (UAVs) is of great significance for a wide range of applications, which simultaneously brings an urgent demand of large, challenging, and comprehensive benchmarks for the development and evaluation of UAV-based models. However, existing benchmarks have limitations in terms of the amount of captured data, types of data modalities, categories of provided tasks, and diversities of subjects and environments. Here we propose a new benchmark - UAVHuman - for human behavior understanding with UAVs, which contains 67,428 multi-modal video sequences and 119 subjects for action recognition, 22,476 frames for pose estimation, 41,290 frames and 1,144 identities for person re-identification, and 22,263 frames for attribute recognition. Our dataset was collected by a flying UAV in multiple urban and rural districts in both daytime and nighttime over three months, hence covering extensive diversities w.r.t subjects, backgrounds, illuminations, weathers, occlusions, camera motions, and UAV flying attitudes. Such a comprehensive and challenging benchmark shall be able to promote the research of UAV-based human behavior understanding, including action recognition, pose estimation, re-identification, and attribute recognition. Furthermore, we propose a fisheye-based action recognition method that mitigates the distortions in fisheye videos via learning unbounded transformations guided by flat RGB videos. Experiments show the efficacy of our method on the UAV-Human dataset.

YOLO based Human Action Recognition and Localization

Article

Full-text available

Jan 2018

Human action recognition in video analytics has been widely studied in recent years. Yet, most of these methods assign a single action label to video after either analyzing a complete video or using classifier for each frame. But when compared to human vision strategy, it can be deduced that we (human) require just an instance of visual data for recognition of scene. It turns out that small group of frames or even single frame from the video are enough for precise recognition. In this paper, we present an approach to detect, localize and recognize actions of interest in almost real-time from frames obtained by a continuous stream of video data that can be captured from a surveillance camera. The model takes input frames after a specified period and is able to give action label based on a single frame. Combining results over specific time we predicted the action label for the stream of video. We demonstrate that YOLO is effective method and comparatively fast for recognition and localization in Liris Human Activities dataset.

AZTR: Aerial Video Action Recognition with Auto Zoom and Temporal Reasoning

Conference Paper

May 2023

FAR: Fourier Aerial Video Recognition

Chapter

Oct 2022

We present an algorithm, Fourier Activity Recognition (FAR), for UAV video activity recognition. Our formulation uses a novel Fourier object disentanglement method to innately separate out the human agent (which is typically small) from the background. Our disentanglement technique operates in the frequency domain to characterize the extent of temporal change of spatial pixels, and exploits convolution-multiplication properties of Fourier transform to map this representation to the corresponding object-background entangled features obtained from the network. To encapsulate contextual information and long-range space-time dependencies, we present a novel Fourier Attention algorithm, which emulates the benefits of self-attention by modeling the weighted outer product in the frequency domain. Our Fourier attention formulation uses much fewer computations than self-attention. We have evaluated our approach on multiple UAV datasets including UAV Human RGB, UAV Human Night, Drone Action, and NEC Drone. We demonstrate a relative improvement of 8.02%–38.69% in top-1 accuracy and up to 3 times faster over prior works.

Video Swin Transformer

Conference Paper

Jun 2022

A Lightweight Action Recognition Method for Unmanned-Aerial-Vehicle Video

Conference Paper

Dec 2020

Fully Autonomous UAV-Based Action Recognition System Using Aerial Imagery

Chapter

Dec 2020

Human action recognition is an important topic in artificial intelligence with a wide range of applications including surveillance systems, search-and-rescue operations, human-computer interaction, etc. However, most of the current action recognition systems utilize videos captured by stationary cameras. Another emerging technology is the use of unmanned ground and aerial vehicles (UAV/UGV) for different tasks such as transportation, traffic control, border patrolling, wild-life monitoring, etc. This technology has become more popular in recent years due to its affordability, high maneuverability, and limited human interventions. However, there does not exist an efficient action recognition algorithm for UAV-based monitoring platforms. This paper considers UAV-based video action recognition by addressing the key issues of aerial imaging systems such as camera motion and vibration, low resolution, and tiny human size. In particular, we propose an automated deep learning-based action recognition system which includes the three stages of video stabilization using the SURF feature selection and Lucas-Kanade method, human action area detection using faster region-based convolutional neural networks (R-CNN), and action recognition. We propose a novel structure that extends and modifies the InceptionResNet-v2 architecture by combining a 3D CNN architecture and a residual network for action recognition. We achieve an average accuracy of 85.83% for the entire-video-level recognition when applying our algorithm to the popular UCF-ARG aerial imaging dataset. This accuracy significantly improves upon the state-of-the-art accuracy by a margin of 17%.

X3D: Expanding Architectures for Efficient Video Recognition

Conference Paper