Conference PaperPDF Available

FFT-UAVNet: FFT Based Human Action Recognition for Drone Surveillance System

Authors:

Abstract and Figures

Unmanned aerial vehicles (UAVs) have emerged as a transformative technology for human action recognition, providing a birds-eye view and unlocking new possibilities for precise and comprehensive support in surveillance systems. While substantial advances in ground-based human action recognition have been achieved, the unique characteristics of UAV footage present new challenges that require tailored solutions. Specifically, the reduced scale of humans in aerial perspectives necessitates the development of specialised models to accurately recognize and interpret human actions. Our research focuses on modifying the well-established C3D model and incorporating Fast Fourier Transform (FFT)-based object disentanglement (FO) and space-time attention (FA) mechanisms. By leveraging the power of FFT, our model effectively disentangles the human actors from the background and captures the spatio-temporal dynamics of human actions in UAV footage, enhancing the discriminative capabilities and enabling accurate action recognition. Through extensive experimentation on a subset of the UAV-Human dataset, our proposed FFT-UAVNet (m-C3D+FO&FA+FC) model demonstrates remarkable improvements in performance. We achieve a Top-1 accuracy of 64.86% and a Top-3 accuracy of 83.37%, surpassing the results obtained by the standard C3D and X3D methods, which achieve only a Top-1 accuracy of 28.05% and 31.33%, respectively. These findings underscore the efficacy of our approach and emphasize the significance of the proposed model for UAV datasets in maximizing the potential of UAV-based human action recognition.
Content may be subject to copyright.
2023 5th International Conference on Sustainable Technologies for
Industry 5.0 (STI), 9-10 December, Dhaka
FFT-UAVNet: FFT Based Human Action
Recognition for Drone Surveillance System
Abdul Monaf Chowdhury, Ahsan Imran, and Md Mehedi Hasan
Dept. of Robotics and Mechatronics Engineering, University of Dhaka, Bangladesh
Email: monafabdul15@gmail.com, ahsaanimraan@gmail.com, mmhasan@du.ac.bd
Abstract—Unmanned aerial vehicles (UAVs) have emerged
as a transformative technology for human action recognition,
providing a birds-eye view and unlocking new possibilities for
precise and comprehensive support in surveillance systems. While
substantial advances in ground-based human action recogni-
tion have been achieved, the unique characteristics of UAV
footage present new challenges that require tailored solutions.
Specifically, the reduced scale of humans in aerial perspectives
necessitates the development of specialised models to accurately
recognize and interpret human actions. Our research focuses on
modifying the well-established C3D model and incorporating Fast
Fourier Transform (FFT)-based object disentanglement (FO) and
space-time attention (FA) mechanisms. By leveraging the power
of FFT, our model effectively disentangles the human actors
from the background and captures the spatio-temporal dynamics
of human actions in UAV footage, enhancing the discriminative
capabilities and enabling accurate action recognition. Through
extensive experimentation on a subset of the UAV-Human dataset,
our proposed FFT-UAVNet (m-C3D+FO&FA+FC) model demon-
strates remarkable improvements in performance. We achieve a
Top-1 accuracy of 64.86% and a Top-3 accuracy of 83.37%,
surpassing the results obtained by the standard C3D and X3D
methods, which achieve only a Top-1 accuracy of 28.05% and
31.33%, respectively. These findings underscore the efficacy of
our approach and emphasize the significance of the proposed
model for UAV datasets in maximizing the potential of UAV-
based human action recognition.
Index Terms—UAV, Human Action Recognition, Computer
Vision, Surveillance
I. INTRODUCTION
UAV-based action recognition has the potential to revolu-
tionise the surveillance system and improve the safety and
security of human lives. The widespread implementation of
this technology will result in an increase in criminal activity
detection at previously unimaginable levels. By leveraging the
unique advantages of UAVs, such as their aerial perspective,
mobility, and action recognition capabilities, it is possible to
develop an effective and efficient methodology for recognizing
and classifying human actions in diverse scenarios which is
necessary for Industry 5.0. Imagine, for instance, a scenario
like Fig 1, where during major events, UAVs are used in public
areas to keep an eye on the people. The system ensures the
safety of eventgoers by identifying and flagging questionable
activities (such as hitting someone, running through a crowd,
or inciting a disturbance) in real-time. Real-time human ac-
tion recognition will allow sophisticated systems to support
emergency response teams.
Fig. 1: UAV-based Human Action Recognition System
Human action recognition systems have been the focus of a
significant amount of research throughout the years, which has
resulted in significant technological improvements. However,
the majority of the work has been accomplished with the use
of ground cameras, which produce static videos. When applied
to the footage acquired by the UAVs, all of this falls flat on
its face, despite the fact that it generates excellent results on
ground-based datasets. Performance declined on UAV-based
dataset for the state-of-the-art action recognition system known
as I3D [1]. On two aerial action recognition datasets, UAV-
Human [2] and UCF-Aerial, I3D only achieved 23.86% and
16.8%, respectively. I3D [1] scored 98.0% and 80.9% on
two prominent ground-based action datasets, UCF-101 [3] and
HMDB-51 [4], respectively. To extract meaningful features
from the video data captured by the UAV, advanced computer
vision techniques are employed that have demonstrated strong
performance in extracting spatio-temporal features from video
data. Activity recognition has extensively employed deep
learning techniques [5][1][6]. Analyzing videos of scenes cap-
tured by UAV cameras [2] is considerably more challenging
979-8-3503-9431-3/23/$31.00 © 2023 IEEE
compared to recognizing activities in datasets from ground
cameras [1]. In addition to modifying traditional feature ex-
traction methods, we incorporate innovative techniques that
take advantage of the unique characteristics of UAV-based
data. For example, Fourier action recognition is utilised to
encode long-range spatio-temporal correlations and automati-
cally separate individuals from the background. By exploit-
ing the convolution-multiplication properties of the Fourier
transform, we can effectively represent and analyse the object-
background entangled characteristics observed from the UAV
perspective. The extracted features are then utilised to train
and evaluate action recognition models. Careful attention has
been given to ensure the proposed approach’s dependability
and robustness. This paper’s contributions are as follows:
A dependable system has been developed that can identify
human actions in a variety of demanding environments
employing UAV in a way that is stable, scalable, and
flexible for surveillance purposes.
The modified 3D convolutional Neural Network has been
developed by reducing parameters and empowering it
with Fast Fourier Transform modules which transform
high-level features in the frequency domain and perform
human object disentanglement as well as find temporal
dependency in consecutive frames.
A higher degree of accuracy has been achieved in the
identification of human activity in different actions.
II. RE LATE D WOR KS
Aerial video activity detection is a difficult task because
of small target sizes, camera mobility, and scarce datasets.
The use of deep learning has yielded encouraging results
in this sector over the years, and numerous methods have
been presented to handle the special issues posed by UAV-
based recordings. Yu et al. developed a ground-breaking 3D
Convolutional Neural Network (CNN) model for human action
recognition[7]. This model extracts motion information from
successive video frames, capturing the temporal dimension in
addition to the spatial dimension. On conventional datasets,
the model produced state-of-the-art results, outperforming
earlier techniques using 2D CNNs or handmade features.
However, it may struggle with occlusions and necessitates
a significant amount of data for training. Shinde et al.
used the YOLO object detection algorithm for human action
recognition and localization [8]. They modified YOLO to
predict class probabilities and bounding boxes for various
actions. The model demonstrated real-time implementation
and competitive performance on the Liris Human Activities
dataset. However, it may face challenges in handling com-
plex backgrounds and recognizing subtle actions. Peng et al.
proposed an automated action recognition system based on
deep learning for UAV aerial imagery [9]. The components
of the system are action recognition, video stabilization, and
human action area detection. They introduced an updated
version of InceptionResNet-v2 called Inception-ResNet-3D for
action recognition, achieving high accuracy on the UCF-ARG
dataset. However, the system may face challenges with fast
drone movements and poor lighting conditions. Ahmad et
al. combined YoloV5 with stochastic gradient boosting for
detecting human actions in drone videos [10]. Their hybrid
approach achieved real-time implementation and competitive
performance on the Okutama-Action dataset. However, it may
have limitations in detecting actions in complex backgrounds
and recognizing slow or subtle actions. Ding et al. presented
a lightweight action recognition framework called LARMUV,
using MobileNetV3 as the feature extraction network [11].
They introduced self-attention for capturing temporal struc-
ture and used focal loss for better optimization. LARMUV
achieved real-time implementation and competitive perfor-
mance on standard datasets. However, it may struggle with
recognizing actions of different sizes and detecting fast or
subtle actions. Xian et al. introduced AZTR, a method for
aerial video action recognition that utilizes temporal reasoning
and auto zoom [12]. The auto zoom technique effectively
isolates the human actor from the background, while tem-
poral reasoning captures long-range space-time dependencies.
AZTR achieved significant improvements in accuracy and real-
time performance on various UAV datasets. However, it may
have limitations in handling complex localization scenarios.
Wang et al. proposed Fourier Activity Recognition (FAR),
a technique for detecting human actions in UAV recordings
[13]. FAR employs Fourier object disentanglement to isolate
the human agent from the backdrop and Fourier attention for
long-range space-time reasoning. FAR achieved substantial
improvements in accuracy and computational efficiency on
multiple UAV datasets. However, it may have limitations in
handling multiple actions in a single frame.
III. METHODOLOGY
Our solution focuses on human-action recognition on un-
manned aerial vehicle datasets. The proposed model is a
combination of a modified C3D (m-C3D) deep learning model,
Fast Fourier Transform (FFT) in the frequency domain and
Fully Connected layers. The proposed model’s detailed archi-
tecture is shown in Figure 2.
A. The Preprocessing Block
The preprocessing block is used to extract samples from
video files and prepare them for further processing. The block
extracts features from video frames by resizing them to a
standardized size (112x112). The block generates sequences
of frames, also known as chunks, to capture the temporal
information in videos. Each chunk consists of 16 consecutive
frames.
B. Modified Convolutional 3D(m-C3D)Model
The C3D model architecture is a state-of-the-art approach
for video analysis and classification. It incorporates 3D con-
volutions to obtain both temporal and spatial information. The
model consists of convolutional layers followed by max pool-
ing operations organized into several layer groups. After the
convolutional layers, two fully connected layers are used with
Video
Pre-processing
Feature Extraction(m-C3D)
Fourier Object
Disentanglement
Fourier Space-
Time Attention
Fully Connected Layer
Classification
Label
Resized 16-frame
chunks
Fig. 2: Proposed Architecture incorporating a Preprocessing Block, m-C3D Block(sky-blue), Fourier Object Disentanglement
Block(green), Fourier Space-Time Attention Block(red), and a Fully Connected Layer followed by 10 Dense Layer with
Softmax Activation Function
dropout regularization to prevent overfitting. The final output
layer generates class probabilities for video clip classification.
In this work, the m-C3D model utilizes the “fc6” layer
to capture more general features, preventing overfitting and
promoting transferability and generalization across tasks and
datasets. This modified version of the conventional 3D CNN
model proved to be performing well in UAV-dataset. We
validated this in the experiment section. By leveraging the
learned representations from a pre-trained C3D model [14],
the model’s performance is improved in specific classification
tasks.
C. Fourier Disentangled Space Time Attention
This section focuses on the module utilized for decoding
human actors’ actions and encoding contextual information.
Fourier Object Disentanglement (FO) automatically separates
the object from the background, while Fourier Space-Time
Attention (FA) incorporates self-attention properties to capture
extensive spatial and temporal relationships at a reduced
computational burden.
D. Fourier Object Disentanglement
In this research, we have used the Fourier Object Disentan-
glement (FO) approach, which effectively addresses the task
of automatically isolating the human agent from the surround-
ings in surveillance scenarios. The movement of the humans
within a scene can be effectively captured by examining the
temporal variations in the feature maps that encode the spatial
information of the video frames across the dimensions of the
scene (H×W).
In order to detect and characterize movement, we begin
by transforming the feature maps into a temporal frequency
space. This transformation allows us to examine the signal’s
behaviour across different temporal frequencies and extract
valuable information regarding the presence of movement in
the scene.
In our approach, we utilize a 1D Fourier transform along the
temporal dimension to perform the necessary computations.
Let feature maps be represented by f(c, t, h, w)C×T0×
H0×W0on which the Fourier Object Disentanglement (FO)
method is applied. Here, (H0×W0)and T0denote the spatial
and temporal dimensions of the feature maps, while Cdenotes
the number of channels. The temporal Fourier transform’s
amplitude at a particular frequency, denoted by 2πk/N, is
calculated as follows:
FT(f)(k) =
n=T0
X
n=0
f(c, t, h, w)×e2πkn/N .(1)
To efficiently compute this transform, we employ the Fast
Fourier Transform (FFT) algorithm [15], which provides an
optimized solution for this task.
For each spatial and channel location in the feature map
f, the amplitude of the temporal signal is captured mathe-
matically as FT(f)(k). In simple terms, higher frequencies in
the temporal dimension correspond to movement, while lower
frequencies indicate static regions in the scene. Consequently,
areas associated with the human actor’s motion should exhibit
higher amplitudes in the Fourier transform at higher frequen-
cies.
It is important to note that the frequencies utilized in the
Fourier Object Disentanglement (FO) technique are indepen-
dent of the input video. Consequently, we can express the
dynamic mask MF O as:
MF O =kFT(f)(k)k2
2× kfrkk2
2(2)
where kak2
2denotes the squared L2-norm of a vector kak
. The dynamic mask MF O serves to disentangle or amplify
the regions in the scene that correspond to moving pixels.
It is important to note that these regions may include both
the moving background (including camera motion) and the
moving human actor. Our subsequent task involves using
MF O to distinguish the shifting object pixels from the moving
background pixels.
We use the model’s activation maps fto isolate the moving
actor. Although not flawless, these activations tend to be higher
in salient regions of the scene compared to non-salient regions.
As a result, the final representation of the disentangled object
can be obtained by taking the dot product of network features
fand MF O, thereby amplifying the dynamic and prominent
areas throughout the frame. Mathematically,
FF O =fMF O .(3)
The disentangled object representation, denoted as FF O,
can be obtained by element-wise multiplication (Hadamard
product) of the activation maps fand the dynamic mask MFO ,
as shown in Equation 3.
E. Space-Time Fourier Attention
In some scenarios foreground and background images are
interconnected. Also, consecutive frames are dependent tem-
porally in the action recognition scenario. Although explicitly
modelling the relationships between individual pixels that
represent orientations, joint motions, and positions may be
unnecessary, it remains essential for the neural network to
recognize and learn these aspects autonomously. Space-time
self-attention has been proven effective in extracting such
knowledge for video action recognition. Several studies such
as [6][16] have explored the use of space-time self-attention
mechanisms to capture temporal and spatial dependencies in
video data. However, these approaches often involve compu-
tationally expensive matrix multiplications, which can limit
their practical applicability. Hence, it is important to consider
the computational cost associated with these approaches. By
leveraging the power of Fourier transformation, FA achieves
this approximation in a computationally efficient manner [17].
The self-attention mechanism relies on key, query, and value
vectors as input, which are derived from a shared input feature
map through 1×1convolutions. According to Vaswani et
al. [18], self-attention is computed by adding up the weights
allocated to each value based on a compatibility function that
evaluates how closely the query and the key matches up. This
compatibility function plays a crucial role in determining the
relevance or importance of each value for the given query.
The key, query, and value components in the self-attention
mechanism are obtained through 1×1convolution layers
applied to the input feature maps. Mathematically, let xdenote
the input feature maps, and represent matrix multiplication.
The attention computation can be expressed as follows:
Attention = Value(x) [Query(x)TKey(x)]T(4)
The space-time Fourier attention method operates in the
following manner. Initially, a representation analogous to the
key-query computation is obtained, referred to as the Fourier
sub-attention. The concept of Fourier sub-attention draws in-
spiration from autocorrelation, which quantifies the correlation
coefficient between distinct segments of a given signal.
Sub-attention in the Fourier domain involves taking the
element-wise product of the complex conjugate of the Fourier
transform of the feature maps with the original feature maps.
To obtain the space-time Fourier sub-attention, the video
feature maps (f) are translated to the frequency domain
through a 2D Fourier transform along the spatial and temporal
axes, resulting in a 3D representation (C×T0×(HW )). The
transformation is expressed using the equation:
FST (f)(m, n) = X
h,w
f(c, t, h, w)e2πmh/M e2πnw/N ,(5)
where mand nrepresent the frequency indices, Mand N
are the dimensions of the spatial and temporal axes respec-
tively. The Fast Fourier Transform (FFT) algorithm [15] is
employed for efficient computation of the Fourier transform.
FFT allows for extensive global interactions between dis-
tinct temporal and spatial regions in the video by expressing
the signal as entirety across a wide range of frequencies.
Multiplying FST by its complex conjugate FS T
yields the
space-time Fourier sub-attentionAST in the Fourier domain,
as shown in Equation 6:
AST =FS T × FST
(6)
For obtaining the correlations in the time domain, we com-
pute the inverse Fast Fourier Transform (IF ) of the space-time
Fourier sub-attention AST . The resulting correlation maps are
then reshaped to match the dimensions of the input feature
maps (C×T0×H0×W0).
The input feature maps are combined with the sub-attention
weights using a dot product method. Final space-time Fourier
attention maps fFAare computed using f. A scaling factor
λF A is empirically chosen to be 0.01 to scale the Fourier atten-
tion maps. The following equation describes the combination
of the input feature maps and the scaled attention maps:
fF A =F+λF A × I F (AS T )(7)
F. Creating the End-to-End Model
In the proposed approach, a pre-trained C3D model is
utilized as a feature extractor, which has been trained on
a large-scale sports video dataset [19]. To ensure that the
pre-trained weights remain unchanged during the new task’s
training, the layers of the pre-trained model are set to be non-
trainable. The output of the pre-trained model then undergoes
a series of operations, including disentanglement and spatial
causality functions. These operations enhance the feature
representation by separating temporal and spatial information
and emphasizing spatially coherent patterns. To further im-
prove generalization and prevent overfitting, dropout layers are
added to the model. Subsequently, fully connected layers are
incorporated to map the enhanced features to the output classes
corresponding to the specific task. Finally, the end-to-end
model is compiled using an appropriate loss function, such as
sparse categorical cross-entropy, and an optimizer, commonly
Adam. Moreover, additional evaluation metrics, like accuracy,
can be specified to assess the model’s performance [19].
IV. EXP ER IM EN T
This experiment section presents a comprehensive analysis
and evaluation of the human action recognition system using
unmanned aerial vehicles (UAVs). The proposed methodology
was applied using the UAV human [2] dataset. With 119
people and 67,428 multi-modal video sequences for action
identification, UAV Human offers a benchmark for understand-
ing human behaviour. The UAV human dataset contains 155
action recognition classes making it suitable for recognising a
wide variety of human actions.
For our surveillance-specific task, we made a strategic
decision to narrow down the dataset and focus on a subset
of classes that were particularly relevant to our research
objective. Specifically, we selected ten action classes that
were characterised by their association with violent actions.
These classes were chosen based on their similarities and
relevance to violence, as it is an important aspect to consider
in UAV-based surveillance scenarios. These action classes
were punching someone, kicking someone, pushing someone,
slapping someone on the back, holding someone hostage,
threatening someone with a knife, threatening someone with a
gun, dragging someone, calling for help, and stabbing someone
with a knife.
Here, the dataset consisted of a total of 10 action classes,
with each class containing 30 videos. To ensure an appropriate
division of the dataset for training and testing purposes, a split
ratio of 80:20 was employed.
During the training phase, features were retrieved from the
dataset and it was preprocessed. For the sake of computational
performance, they were stored utilising memory mapping.
After some iterations of training, the suggested model was
applied to the data subset, yielding the following findings:
Loss = 0.992
Top-1 Accuracy = 64.86%
Top-3 Accuracy = 83.37%
The left portion of Fig 3 provides a visual representation
of the loss function during the training of the model for five
epochs. The plot shows how the loss function changes over
the course of these initial five epochs. With each successive
epoch, there is a sharp decrease in the loss of the model.
This decreasing trend suggests that the model’s performance
improves as it undergoes more epochs.
As the loss of the model decreased during the training
process, there was a noticeable increase in the overall accuracy
of the system, as demonstrated on the right side of Fig
Fig. 3: On the left loss function of the trained model and on
the right accuracy curve of the proposed model
Fig. 4: ROC curve of the proposed model
3. The improvement in accuracy suggests that the model’s
performance was enhanced as it learned from the training data
and iteratively updated its parameters to minimise the loss.
Furthermore, the validation set, used to evaluate the model’s
generalisation, exhibited even better results in detecting human
actions compared to the training set.
The Receiver Operating Characteristic (ROC) curve in Fig
4 involved 10 distinct classes. For a range of threshold values,
the ROC curve plots the true positive rate (sensitivity) against
the false positive rate (1-specificity). By plotting these values,
we gained insights into how well the model distinguished
the positive class from the negative classes, independently
for each of the ten classes. As this is a multi-class action
recognition problem, we used a One-vs-All(OvA) approach.
To assess each class’s area under the ROC curve, one class is
set positive and the others are set negative. Ultimately, a single
graph is created by combining all of the ROC classes. Our
results revealed that the proposed model achieved excellent
discrimination performance for most classes, as evidenced by
the high area under the ROC curve (AUC) values consistently
above 0.8.
To gain deeper insights into the model’s classification accu-
racy, we utilised a confusion matrix analysis. For our ten-class
problem, the confusion matrix in Fig 5 is a 10x10 matrix,
where every column represents the anticipated class, while
every row represents the actual class. The main diagonal of
the confusion matrix represents the true positive predictions for
each class, indicating the number of instances correctly classi-
fied for each class. For example, the value 80 in the eighth row
and eighth column of the confusion matrix indicates that the
model correctly classified 80 instances of class 8 (Call for help
- The multiple classes are numericalized for computational
simplicity). Off-diagonal elements signify misclassifications.
For example, the value 8 in the top right corner of the
confusion matrix indicates that the model incorrectly classified
8 instances of class 0 (Punching someone) as class 9 (Stabbing
Fig. 5: Confusion matrix of the trained model
someone with a knife). The confusion matrix in Figure 6
illustrates that the model performed well overall, with a high
percentage of true positives for all classes.
The proposed model’s effectiveness was further assessed by
comparing it with two state-of-the-art models, namely C3D
[20] and X3D [5] without making any modifications to the
network architectures as can be seen in Table I. Two fully
connected layers at the end of both networks were used to
recognize actions. The same dataset was used to train and
assess the C3D and X3D models for the same set of classes
and the same number of epochs. The outcomes of the C3D and
X3D models were notably inferior to those of the suggested
model. Compared to X3D and C3D, the suggested model
outperformed both in terms of top-1 accuracy by 33.53% and
36.81%, respectively. FFT-UAVNet also outperformed both
X3D and C3D in terms of top-3 accuracy, by 24.72% and
37.77%, respectively.
TABLE I: Evaluation on C3D, X3D, and proposed model
Method Loss Top-1 Accuracy Top-3 Accuracy
C3D [20] 2.145 28.05% 58.65%
X3D [5] 1.977 31.33% 45.60%
Proposed Method(FFT-UAVNet) 0.992 64.86% 83.37%
This observation indicated that the proposed method demon-
strated an improvement in action recognition and detection
compared to the C3D and X3D models. It suggested that the
proposed model was more effective in accurately evaluating
and detecting various human actions within the given dataset.
V. CONCLUSION
We have developed a UAV-based system for human ac-
tion recognition in video data, aiming to improve precision
and speed in practical applications. Through extensive re-
search, we addressed limitations of traditional approaches and
achieved significant improvements using UAVs. Our frame-
work combined modified C3D algorithms with Fourier Action
Recognition to capture spatial and temporal information in
aerial videos. The deep learning architectures enabled precise
detection and classification of human actions. Experimen-
tal results validated the effectiveness and robustness of our
method, showing improved accuracy and efficiency compared
to conventional C3D and X3D approaches. The scalability and
adaptability of our framework make it suitable for various
applications, especially surveillance.
REFERENCES
[1] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new
model and the kinetics dataset,” in proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
[2] T. Li, J. Liu, W. Zhang, Y. Ni, W. Wang, and Z. Li, “Uav-human: A large
benchmark for human behavior understanding with unmanned aerial
vehicles,” in Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, 2021, pp. 16 266–16 275.
[3] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human
actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402,
2012.
[4] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a
large video database for human motion recognition,” in 2011 Interna-
tional conference on computer vision. IEEE, 2011, pp. 2556–2563.
[5] C. Feichtenhofer, “X3d: Expanding architectures for efficient video
recognition,” in Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, 2020, pp. 203–213.
[6] G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all
you need for video understanding?” in ICML, vol. 2, no. 3, 2021, p. 4.
[7] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks
for human action recognition,” IEEE transactions on pattern analysis
and machine intelligence, vol. 35, no. 1, pp. 221–231, 2012.
[8] S. Shinde, A. Kothari, and V. Gupta, “Yolo based human action
recognition and localization,” Procedia computer science, vol. 133, pp.
831–838, 2018.
[9] H. Peng and A. Razi, “Fully autonomous uav-based action recognition
system using aerial imagery, in International symposium on visual
computing. Springer, 2020, pp. 276–290.
[10] T. Ahmad, M. Cavazza, Y. Matsuo, and H. Prendinger, “Detecting human
actions in drone images using yolov5 and stochastic gradient boosting,”
Sensors, vol. 22, no. 18, p. 7020, 2022.
[11] M. Ding, N. Li, Z. Song, R. Zhang, X. Zhang, and H. Zhou,
“A lightweight action recognition method for unmanned-aerial-vehicle
video,” in 2020 IEEE 3rd International Conference on Electronics and
Communication Engineering (ICECE). IEEE, 2020, pp. 181–185.
[12] X. Wang, R. Xian, T. Guan, C. M. de Melo, S. M. Nogar, A. Bera, and
D. Manocha, “Aztr: Aerial video action recognition with auto zoom and
temporal reasoning,” arXiv preprint arXiv:2303.01589, 2023.
[13] D. Kothandaraman, T. Guan, X. Wang, S. Hu, M. Lin, and D. Manocha,
“Far: Fourier aerial video recognition, in European Conference on
Computer Vision. Springer, 2022, pp. 657–676.
[14] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
spatiotemporal features with 3d convolutional networks, in Proceedings
of the IEEE international conference on computer vision, 2015, pp.
4489–4497.
[15] M. Frigo and S. G. Johnson, “Fftw: An adaptive software architecture
for the fft, in Proceedings of the 1998 IEEE International Conference
on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No.
98CH36181), vol. 3. IEEE, 1998, pp. 1381–1384.
[16] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video
swin transformer, arXiv preprint arXiv:2106.13230, 2021.
[17] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention
generative adversarial networks, in International conference on machine
learning. PMLR, 2019, pp. 7354–7363.
[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, Attention is all you need,” in Advances
in neural information processing systems, 2017, pp. 5998–6008.
[19] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
L. Fei-Fei, “Large-scale video classification with convolutional neural
networks,” in Proceedings of the IEEE conference on Computer Vision
and Pattern Recognition, 2014, pp. 1725–1732.
[20] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
spatiotemporal features with 3d convolutional networks, in Proceedings
of the IEEE international conference on computer vision, 2015, pp.
4489–4497.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Human action recognition and detection from unmanned aerial vehicles (UAVs), or drones, has emerged as a popular technical challenge in recent years, since it is related to many use case scenarios from environmental monitoring to search and rescue. It faces a number of difficulties mainly due to image acquisition and contents, and processing constraints. Since drones’ flying conditions constrain image acquisition, human subjects may appear in images at variable scales, orientations, and occlusion, which makes action recognition more difficult. We explore low-resource methods for ML (machine learning)-based action recognition using a previously collected real-world dataset (the “Okutama-Action” dataset). This dataset contains representative situations for action recognition, yet is controlled for image acquisition parameters such as camera angle or flight altitude. We investigate a combination of object recognition and classifier techniques to support single-image action identification. Our architecture integrates YoloV5 with a gradient boosting classifier; the rationale is to use a scalable and efficient object recognition system coupled with a classifier that is able to incorporate samples of variable difficulty. In an ablation study, we test different architectures of YoloV5 and evaluate the performance of our method on Okutama-Action dataset. Our approach outperformed previous architectures applied to the Okutama dataset, which differed by their object identification and classification pipeline: we hypothesize that this is a consequence of both Yolov5 performance and the overall adequacy of our pipeline to the specificities of the Okutama dataset in terms of bias–variance tradeoff.
Conference Paper
Full-text available
Human behavior understanding with unmanned aerial vehicles (UAVs) is of great significance for a wide range of applications, which simultaneously brings an urgent demand of large, challenging, and comprehensive benchmarks for the development and evaluation of UAV-based models. However, existing benchmarks have limitations in terms of the amount of captured data, types of data modalities, categories of provided tasks, and diversities of subjects and environments. Here we propose a new benchmark - UAVHuman - for human behavior understanding with UAVs, which contains 67,428 multi-modal video sequences and 119 subjects for action recognition, 22,476 frames for pose estimation, 41,290 frames and 1,144 identities for person re-identification, and 22,263 frames for attribute recognition. Our dataset was collected by a flying UAV in multiple urban and rural districts in both daytime and nighttime over three months, hence covering extensive diversities w.r.t subjects, backgrounds, illuminations, weathers, occlusions, camera motions, and UAV flying attitudes. Such a comprehensive and challenging benchmark shall be able to promote the research of UAV-based human behavior understanding, including action recognition, pose estimation, re-identification, and attribute recognition. Furthermore, we propose a fisheye-based action recognition method that mitigates the distortions in fisheye videos via learning unbounded transformations guided by flat RGB videos. Experiments show the efficacy of our method on the UAV-Human dataset.
Article
Full-text available
Human action recognition in video analytics has been widely studied in recent years. Yet, most of these methods assign a single action label to video after either analyzing a complete video or using classifier for each frame. But when compared to human vision strategy, it can be deduced that we (human) require just an instance of visual data for recognition of scene. It turns out that small group of frames or even single frame from the video are enough for precise recognition. In this paper, we present an approach to detect, localize and recognize actions of interest in almost real-time from frames obtained by a continuous stream of video data that can be captured from a surveillance camera. The model takes input frames after a specified period and is able to give action label based on a single frame. Combining results over specific time we predicted the action label for the stream of video. We demonstrate that YOLO is effective method and comparatively fast for recognition and localization in Liris Human Activities dataset.
Chapter
We present an algorithm, Fourier Activity Recognition (FAR), for UAV video activity recognition. Our formulation uses a novel Fourier object disentanglement method to innately separate out the human agent (which is typically small) from the background. Our disentanglement technique operates in the frequency domain to characterize the extent of temporal change of spatial pixels, and exploits convolution-multiplication properties of Fourier transform to map this representation to the corresponding object-background entangled features obtained from the network. To encapsulate contextual information and long-range space-time dependencies, we present a novel Fourier Attention algorithm, which emulates the benefits of self-attention by modeling the weighted outer product in the frequency domain. Our Fourier attention formulation uses much fewer computations than self-attention. We have evaluated our approach on multiple UAV datasets including UAV Human RGB, UAV Human Night, Drone Action, and NEC Drone. We demonstrate a relative improvement of 8.02%–38.69% in top-1 accuracy and up to 3 times faster over prior works.
Chapter
Human action recognition is an important topic in artificial intelligence with a wide range of applications including surveillance systems, search-and-rescue operations, human-computer interaction, etc. However, most of the current action recognition systems utilize videos captured by stationary cameras. Another emerging technology is the use of unmanned ground and aerial vehicles (UAV/UGV) for different tasks such as transportation, traffic control, border patrolling, wild-life monitoring, etc. This technology has become more popular in recent years due to its affordability, high maneuverability, and limited human interventions. However, there does not exist an efficient action recognition algorithm for UAV-based monitoring platforms. This paper considers UAV-based video action recognition by addressing the key issues of aerial imaging systems such as camera motion and vibration, low resolution, and tiny human size. In particular, we propose an automated deep learning-based action recognition system which includes the three stages of video stabilization using the SURF feature selection and Lucas-Kanade method, human action area detection using faster region-based convolutional neural networks (R-CNN), and action recognition. We propose a novel structure that extends and modifies the InceptionResNet-v2 architecture by combining a 3D CNN architecture and a residual network for action recognition. We achieve an average accuracy of 85.83% for the entire-video-level recognition when applying our algorithm to the popular UCF-ARG aerial imaging dataset. This accuracy significantly improves upon the state-of-the-art accuracy by a margin of 17%.