Content uploaded by Ahsan Imran
Author content
All content in this area was uploaded by Ahsan Imran on Apr 10, 2024
Content may be subject to copyright.
2023 5th International Conference on Sustainable Technologies for
Industry 5.0 (STI), 9-10 December, Dhaka
FFT-UAVNet: FFT Based Human Action
Recognition for Drone Surveillance System
Abdul Monaf Chowdhury∗, Ahsan Imran∗, and Md Mehedi Hasan∗
∗Dept. of Robotics and Mechatronics Engineering, University of Dhaka, Bangladesh
Email: monafabdul15@gmail.com, ahsaanimraan@gmail.com, mmhasan@du.ac.bd
Abstract—Unmanned aerial vehicles (UAVs) have emerged
as a transformative technology for human action recognition,
providing a birds-eye view and unlocking new possibilities for
precise and comprehensive support in surveillance systems. While
substantial advances in ground-based human action recogni-
tion have been achieved, the unique characteristics of UAV
footage present new challenges that require tailored solutions.
Specifically, the reduced scale of humans in aerial perspectives
necessitates the development of specialised models to accurately
recognize and interpret human actions. Our research focuses on
modifying the well-established C3D model and incorporating Fast
Fourier Transform (FFT)-based object disentanglement (FO) and
space-time attention (FA) mechanisms. By leveraging the power
of FFT, our model effectively disentangles the human actors
from the background and captures the spatio-temporal dynamics
of human actions in UAV footage, enhancing the discriminative
capabilities and enabling accurate action recognition. Through
extensive experimentation on a subset of the UAV-Human dataset,
our proposed FFT-UAVNet (m-C3D+FO&FA+FC) model demon-
strates remarkable improvements in performance. We achieve a
Top-1 accuracy of 64.86% and a Top-3 accuracy of 83.37%,
surpassing the results obtained by the standard C3D and X3D
methods, which achieve only a Top-1 accuracy of 28.05% and
31.33%, respectively. These findings underscore the efficacy of
our approach and emphasize the significance of the proposed
model for UAV datasets in maximizing the potential of UAV-
based human action recognition.
Index Terms—UAV, Human Action Recognition, Computer
Vision, Surveillance
I. INTRODUCTION
UAV-based action recognition has the potential to revolu-
tionise the surveillance system and improve the safety and
security of human lives. The widespread implementation of
this technology will result in an increase in criminal activity
detection at previously unimaginable levels. By leveraging the
unique advantages of UAVs, such as their aerial perspective,
mobility, and action recognition capabilities, it is possible to
develop an effective and efficient methodology for recognizing
and classifying human actions in diverse scenarios which is
necessary for Industry 5.0. Imagine, for instance, a scenario
like Fig 1, where during major events, UAVs are used in public
areas to keep an eye on the people. The system ensures the
safety of eventgoers by identifying and flagging questionable
activities (such as hitting someone, running through a crowd,
or inciting a disturbance) in real-time. Real-time human ac-
tion recognition will allow sophisticated systems to support
emergency response teams.
Fig. 1: UAV-based Human Action Recognition System
Human action recognition systems have been the focus of a
significant amount of research throughout the years, which has
resulted in significant technological improvements. However,
the majority of the work has been accomplished with the use
of ground cameras, which produce static videos. When applied
to the footage acquired by the UAVs, all of this falls flat on
its face, despite the fact that it generates excellent results on
ground-based datasets. Performance declined on UAV-based
dataset for the state-of-the-art action recognition system known
as I3D [1]. On two aerial action recognition datasets, UAV-
Human [2] and UCF-Aerial, I3D only achieved 23.86% and
16.8%, respectively. I3D [1] scored 98.0% and 80.9% on
two prominent ground-based action datasets, UCF-101 [3] and
HMDB-51 [4], respectively. To extract meaningful features
from the video data captured by the UAV, advanced computer
vision techniques are employed that have demonstrated strong
performance in extracting spatio-temporal features from video
data. Activity recognition has extensively employed deep
learning techniques [5][1][6]. Analyzing videos of scenes cap-
tured by UAV cameras [2] is considerably more challenging
979-8-3503-9431-3/23/$31.00 © 2023 IEEE
compared to recognizing activities in datasets from ground
cameras [1]. In addition to modifying traditional feature ex-
traction methods, we incorporate innovative techniques that
take advantage of the unique characteristics of UAV-based
data. For example, Fourier action recognition is utilised to
encode long-range spatio-temporal correlations and automati-
cally separate individuals from the background. By exploit-
ing the convolution-multiplication properties of the Fourier
transform, we can effectively represent and analyse the object-
background entangled characteristics observed from the UAV
perspective. The extracted features are then utilised to train
and evaluate action recognition models. Careful attention has
been given to ensure the proposed approach’s dependability
and robustness. This paper’s contributions are as follows:
•A dependable system has been developed that can identify
human actions in a variety of demanding environments
employing UAV in a way that is stable, scalable, and
flexible for surveillance purposes.
•The modified 3D convolutional Neural Network has been
developed by reducing parameters and empowering it
with Fast Fourier Transform modules which transform
high-level features in the frequency domain and perform
human object disentanglement as well as find temporal
dependency in consecutive frames.
•A higher degree of accuracy has been achieved in the
identification of human activity in different actions.
II. RE LATE D WOR KS
Aerial video activity detection is a difficult task because
of small target sizes, camera mobility, and scarce datasets.
The use of deep learning has yielded encouraging results
in this sector over the years, and numerous methods have
been presented to handle the special issues posed by UAV-
based recordings. Yu et al. developed a ground-breaking 3D
Convolutional Neural Network (CNN) model for human action
recognition[7]. This model extracts motion information from
successive video frames, capturing the temporal dimension in
addition to the spatial dimension. On conventional datasets,
the model produced state-of-the-art results, outperforming
earlier techniques using 2D CNNs or handmade features.
However, it may struggle with occlusions and necessitates
a significant amount of data for training. Shinde et al.
used the YOLO object detection algorithm for human action
recognition and localization [8]. They modified YOLO to
predict class probabilities and bounding boxes for various
actions. The model demonstrated real-time implementation
and competitive performance on the Liris Human Activities
dataset. However, it may face challenges in handling com-
plex backgrounds and recognizing subtle actions. Peng et al.
proposed an automated action recognition system based on
deep learning for UAV aerial imagery [9]. The components
of the system are action recognition, video stabilization, and
human action area detection. They introduced an updated
version of InceptionResNet-v2 called Inception-ResNet-3D for
action recognition, achieving high accuracy on the UCF-ARG
dataset. However, the system may face challenges with fast
drone movements and poor lighting conditions. Ahmad et
al. combined YoloV5 with stochastic gradient boosting for
detecting human actions in drone videos [10]. Their hybrid
approach achieved real-time implementation and competitive
performance on the Okutama-Action dataset. However, it may
have limitations in detecting actions in complex backgrounds
and recognizing slow or subtle actions. Ding et al. presented
a lightweight action recognition framework called LARMUV,
using MobileNetV3 as the feature extraction network [11].
They introduced self-attention for capturing temporal struc-
ture and used focal loss for better optimization. LARMUV
achieved real-time implementation and competitive perfor-
mance on standard datasets. However, it may struggle with
recognizing actions of different sizes and detecting fast or
subtle actions. Xian et al. introduced AZTR, a method for
aerial video action recognition that utilizes temporal reasoning
and auto zoom [12]. The auto zoom technique effectively
isolates the human actor from the background, while tem-
poral reasoning captures long-range space-time dependencies.
AZTR achieved significant improvements in accuracy and real-
time performance on various UAV datasets. However, it may
have limitations in handling complex localization scenarios.
Wang et al. proposed Fourier Activity Recognition (FAR),
a technique for detecting human actions in UAV recordings
[13]. FAR employs Fourier object disentanglement to isolate
the human agent from the backdrop and Fourier attention for
long-range space-time reasoning. FAR achieved substantial
improvements in accuracy and computational efficiency on
multiple UAV datasets. However, it may have limitations in
handling multiple actions in a single frame.
III. METHODOLOGY
Our solution focuses on human-action recognition on un-
manned aerial vehicle datasets. The proposed model is a
combination of a modified C3D (m-C3D) deep learning model,
Fast Fourier Transform (FFT) in the frequency domain and
Fully Connected layers. The proposed model’s detailed archi-
tecture is shown in Figure 2.
A. The Preprocessing Block
The preprocessing block is used to extract samples from
video files and prepare them for further processing. The block
extracts features from video frames by resizing them to a
standardized size (112x112). The block generates sequences
of frames, also known as chunks, to capture the temporal
information in videos. Each chunk consists of 16 consecutive
frames.
B. Modified Convolutional 3D(m-C3D)Model
The C3D model architecture is a state-of-the-art approach
for video analysis and classification. It incorporates 3D con-
volutions to obtain both temporal and spatial information. The
model consists of convolutional layers followed by max pool-
ing operations organized into several layer groups. After the
convolutional layers, two fully connected layers are used with
Video
Pre-processing
Feature Extraction(m-C3D)
Fourier Object
Disentanglement
Fourier Space-
Time Attention
Fully Connected Layer
Classification
Label
Resized 16-frame
chunks
Fig. 2: Proposed Architecture incorporating a Preprocessing Block, m-C3D Block(sky-blue), Fourier Object Disentanglement
Block(green), Fourier Space-Time Attention Block(red), and a Fully Connected Layer followed by 10 Dense Layer with
Softmax Activation Function
dropout regularization to prevent overfitting. The final output
layer generates class probabilities for video clip classification.
In this work, the m-C3D model utilizes the “fc6” layer
to capture more general features, preventing overfitting and
promoting transferability and generalization across tasks and
datasets. This modified version of the conventional 3D CNN
model proved to be performing well in UAV-dataset. We
validated this in the experiment section. By leveraging the
learned representations from a pre-trained C3D model [14],
the model’s performance is improved in specific classification
tasks.
C. Fourier Disentangled Space Time Attention
This section focuses on the module utilized for decoding
human actors’ actions and encoding contextual information.
Fourier Object Disentanglement (FO) automatically separates
the object from the background, while Fourier Space-Time
Attention (FA) incorporates self-attention properties to capture
extensive spatial and temporal relationships at a reduced
computational burden.
D. Fourier Object Disentanglement
In this research, we have used the Fourier Object Disentan-
glement (FO) approach, which effectively addresses the task
of automatically isolating the human agent from the surround-
ings in surveillance scenarios. The movement of the humans
within a scene can be effectively captured by examining the
temporal variations in the feature maps that encode the spatial
information of the video frames across the dimensions of the
scene (H×W).
In order to detect and characterize movement, we begin
by transforming the feature maps into a temporal frequency
space. This transformation allows us to examine the signal’s
behaviour across different temporal frequencies and extract
valuable information regarding the presence of movement in
the scene.
In our approach, we utilize a 1D Fourier transform along the
temporal dimension to perform the necessary computations.
Let feature maps be represented by f(c, t, h, w)∈C×T0×
H0×W0on which the Fourier Object Disentanglement (FO)
method is applied. Here, (H0×W0)and T0denote the spatial
and temporal dimensions of the feature maps, while Cdenotes
the number of channels. The temporal Fourier transform’s
amplitude at a particular frequency, denoted by −2πk/N, is
calculated as follows:
FT(f)(k) =
n=T0
X
n=0
f(c, t, h, w)×e−2πkn/N .(1)
To efficiently compute this transform, we employ the Fast
Fourier Transform (FFT) algorithm [15], which provides an
optimized solution for this task.
For each spatial and channel location in the feature map
f, the amplitude of the temporal signal is captured mathe-
matically as FT(f)(k). In simple terms, higher frequencies in
the temporal dimension correspond to movement, while lower
frequencies indicate static regions in the scene. Consequently,
areas associated with the human actor’s motion should exhibit
higher amplitudes in the Fourier transform at higher frequen-
cies.
It is important to note that the frequencies utilized in the
Fourier Object Disentanglement (FO) technique are indepen-
dent of the input video. Consequently, we can express the
dynamic mask MF O as:
MF O =kFT(f)(k)k2
2× kfrkk2
2(2)
where kak2
2denotes the squared L2-norm of a vector kak
. The dynamic mask MF O serves to disentangle or amplify
the regions in the scene that correspond to moving pixels.
It is important to note that these regions may include both
the moving background (including camera motion) and the
moving human actor. Our subsequent task involves using
MF O to distinguish the shifting object pixels from the moving
background pixels.
We use the model’s activation maps fto isolate the moving
actor. Although not flawless, these activations tend to be higher
in salient regions of the scene compared to non-salient regions.
As a result, the final representation of the disentangled object
can be obtained by taking the dot product of network features
fand MF O, thereby amplifying the dynamic and prominent
areas throughout the frame. Mathematically,
FF O =fMF O .(3)
The disentangled object representation, denoted as FF O,
can be obtained by element-wise multiplication (Hadamard
product) of the activation maps fand the dynamic mask MFO ,
as shown in Equation 3.
E. Space-Time Fourier Attention
In some scenarios foreground and background images are
interconnected. Also, consecutive frames are dependent tem-
porally in the action recognition scenario. Although explicitly
modelling the relationships between individual pixels that
represent orientations, joint motions, and positions may be
unnecessary, it remains essential for the neural network to
recognize and learn these aspects autonomously. Space-time
self-attention has been proven effective in extracting such
knowledge for video action recognition. Several studies such
as [6][16] have explored the use of space-time self-attention
mechanisms to capture temporal and spatial dependencies in
video data. However, these approaches often involve compu-
tationally expensive matrix multiplications, which can limit
their practical applicability. Hence, it is important to consider
the computational cost associated with these approaches. By
leveraging the power of Fourier transformation, FA achieves
this approximation in a computationally efficient manner [17].
The self-attention mechanism relies on key, query, and value
vectors as input, which are derived from a shared input feature
map through 1×1convolutions. According to Vaswani et
al. [18], self-attention is computed by adding up the weights
allocated to each value based on a compatibility function that
evaluates how closely the query and the key matches up. This
compatibility function plays a crucial role in determining the
relevance or importance of each value for the given query.
The key, query, and value components in the self-attention
mechanism are obtained through 1×1convolution layers
applied to the input feature maps. Mathematically, let xdenote
the input feature maps, and represent matrix multiplication.
The attention computation can be expressed as follows:
Attention = Value(x) [Query(x)TKey(x)]T(4)
The space-time Fourier attention method operates in the
following manner. Initially, a representation analogous to the
key-query computation is obtained, referred to as the Fourier
sub-attention. The concept of Fourier sub-attention draws in-
spiration from autocorrelation, which quantifies the correlation
coefficient between distinct segments of a given signal.
Sub-attention in the Fourier domain involves taking the
element-wise product of the complex conjugate of the Fourier
transform of the feature maps with the original feature maps.
To obtain the space-time Fourier sub-attention, the video
feature maps (f) are translated to the frequency domain
through a 2D Fourier transform along the spatial and temporal
axes, resulting in a 3D representation (C×T0×(HW )). The
transformation is expressed using the equation:
FST (f)(m, n) = X
h,w
f(c, t, h, w)e−2πmh/M e−2πnw/N ,(5)
where mand nrepresent the frequency indices, Mand N
are the dimensions of the spatial and temporal axes respec-
tively. The Fast Fourier Transform (FFT) algorithm [15] is
employed for efficient computation of the Fourier transform.
FFT allows for extensive global interactions between dis-
tinct temporal and spatial regions in the video by expressing
the signal as entirety across a wide range of frequencies.
Multiplying FST by its complex conjugate FS T
∗yields the
space-time Fourier sub-attentionAST in the Fourier domain,
as shown in Equation 6:
AST =FS T × FST
∗(6)
For obtaining the correlations in the time domain, we com-
pute the inverse Fast Fourier Transform (IF ) of the space-time
Fourier sub-attention AST . The resulting correlation maps are
then reshaped to match the dimensions of the input feature
maps (C×T0×H0×W0).
The input feature maps are combined with the sub-attention
weights using a dot product method. Final space-time Fourier
attention maps fFAare computed using f. A scaling factor
λF A is empirically chosen to be 0.01 to scale the Fourier atten-
tion maps. The following equation describes the combination
of the input feature maps and the scaled attention maps:
fF A =F+λF A × I F (AS T )(7)
F. Creating the End-to-End Model
In the proposed approach, a pre-trained C3D model is
utilized as a feature extractor, which has been trained on
a large-scale sports video dataset [19]. To ensure that the
pre-trained weights remain unchanged during the new task’s
training, the layers of the pre-trained model are set to be non-
trainable. The output of the pre-trained model then undergoes
a series of operations, including disentanglement and spatial
causality functions. These operations enhance the feature
representation by separating temporal and spatial information
and emphasizing spatially coherent patterns. To further im-
prove generalization and prevent overfitting, dropout layers are
added to the model. Subsequently, fully connected layers are
incorporated to map the enhanced features to the output classes
corresponding to the specific task. Finally, the end-to-end
model is compiled using an appropriate loss function, such as
sparse categorical cross-entropy, and an optimizer, commonly
Adam. Moreover, additional evaluation metrics, like accuracy,
can be specified to assess the model’s performance [19].
IV. EXP ER IM EN T
This experiment section presents a comprehensive analysis
and evaluation of the human action recognition system using
unmanned aerial vehicles (UAVs). The proposed methodology
was applied using the UAV human [2] dataset. With 119
people and 67,428 multi-modal video sequences for action
identification, UAV Human offers a benchmark for understand-
ing human behaviour. The UAV human dataset contains 155
action recognition classes making it suitable for recognising a
wide variety of human actions.
For our surveillance-specific task, we made a strategic
decision to narrow down the dataset and focus on a subset
of classes that were particularly relevant to our research
objective. Specifically, we selected ten action classes that
were characterised by their association with violent actions.
These classes were chosen based on their similarities and
relevance to violence, as it is an important aspect to consider
in UAV-based surveillance scenarios. These action classes
were punching someone, kicking someone, pushing someone,
slapping someone on the back, holding someone hostage,
threatening someone with a knife, threatening someone with a
gun, dragging someone, calling for help, and stabbing someone
with a knife.
Here, the dataset consisted of a total of 10 action classes,
with each class containing 30 videos. To ensure an appropriate
division of the dataset for training and testing purposes, a split
ratio of 80:20 was employed.
During the training phase, features were retrieved from the
dataset and it was preprocessed. For the sake of computational
performance, they were stored utilising memory mapping.
After some iterations of training, the suggested model was
applied to the data subset, yielding the following findings:
•Loss = 0.992
•Top-1 Accuracy = 64.86%
•Top-3 Accuracy = 83.37%
The left portion of Fig 3 provides a visual representation
of the loss function during the training of the model for five
epochs. The plot shows how the loss function changes over
the course of these initial five epochs. With each successive
epoch, there is a sharp decrease in the loss of the model.
This decreasing trend suggests that the model’s performance
improves as it undergoes more epochs.
As the loss of the model decreased during the training
process, there was a noticeable increase in the overall accuracy
of the system, as demonstrated on the right side of Fig
Fig. 3: On the left loss function of the trained model and on
the right accuracy curve of the proposed model
Fig. 4: ROC curve of the proposed model
3. The improvement in accuracy suggests that the model’s
performance was enhanced as it learned from the training data
and iteratively updated its parameters to minimise the loss.
Furthermore, the validation set, used to evaluate the model’s
generalisation, exhibited even better results in detecting human
actions compared to the training set.
The Receiver Operating Characteristic (ROC) curve in Fig
4 involved 10 distinct classes. For a range of threshold values,
the ROC curve plots the true positive rate (sensitivity) against
the false positive rate (1-specificity). By plotting these values,
we gained insights into how well the model distinguished
the positive class from the negative classes, independently
for each of the ten classes. As this is a multi-class action
recognition problem, we used a One-vs-All(OvA) approach.
To assess each class’s area under the ROC curve, one class is
set positive and the others are set negative. Ultimately, a single
graph is created by combining all of the ROC classes. Our
results revealed that the proposed model achieved excellent
discrimination performance for most classes, as evidenced by
the high area under the ROC curve (AUC) values consistently
above 0.8.
To gain deeper insights into the model’s classification accu-
racy, we utilised a confusion matrix analysis. For our ten-class
problem, the confusion matrix in Fig 5 is a 10x10 matrix,
where every column represents the anticipated class, while
every row represents the actual class. The main diagonal of
the confusion matrix represents the true positive predictions for
each class, indicating the number of instances correctly classi-
fied for each class. For example, the value 80 in the eighth row
and eighth column of the confusion matrix indicates that the
model correctly classified 80 instances of class 8 (Call for help
- The multiple classes are numericalized for computational
simplicity). Off-diagonal elements signify misclassifications.
For example, the value 8 in the top right corner of the
confusion matrix indicates that the model incorrectly classified
8 instances of class 0 (Punching someone) as class 9 (Stabbing
Fig. 5: Confusion matrix of the trained model
someone with a knife). The confusion matrix in Figure 6
illustrates that the model performed well overall, with a high
percentage of true positives for all classes.
The proposed model’s effectiveness was further assessed by
comparing it with two state-of-the-art models, namely C3D
[20] and X3D [5] without making any modifications to the
network architectures as can be seen in Table I. Two fully
connected layers at the end of both networks were used to
recognize actions. The same dataset was used to train and
assess the C3D and X3D models for the same set of classes
and the same number of epochs. The outcomes of the C3D and
X3D models were notably inferior to those of the suggested
model. Compared to X3D and C3D, the suggested model
outperformed both in terms of top-1 accuracy by 33.53% and
36.81%, respectively. FFT-UAVNet also outperformed both
X3D and C3D in terms of top-3 accuracy, by 24.72% and
37.77%, respectively.
TABLE I: Evaluation on C3D, X3D, and proposed model
Method Loss Top-1 Accuracy Top-3 Accuracy
C3D [20] 2.145 28.05% 58.65%
X3D [5] 1.977 31.33% 45.60%
Proposed Method(FFT-UAVNet) 0.992 64.86% 83.37%
This observation indicated that the proposed method demon-
strated an improvement in action recognition and detection
compared to the C3D and X3D models. It suggested that the
proposed model was more effective in accurately evaluating
and detecting various human actions within the given dataset.
V. CONCLUSION
We have developed a UAV-based system for human ac-
tion recognition in video data, aiming to improve precision
and speed in practical applications. Through extensive re-
search, we addressed limitations of traditional approaches and
achieved significant improvements using UAVs. Our frame-
work combined modified C3D algorithms with Fourier Action
Recognition to capture spatial and temporal information in
aerial videos. The deep learning architectures enabled precise
detection and classification of human actions. Experimen-
tal results validated the effectiveness and robustness of our
method, showing improved accuracy and efficiency compared
to conventional C3D and X3D approaches. The scalability and
adaptability of our framework make it suitable for various
applications, especially surveillance.
REFERENCES
[1] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new
model and the kinetics dataset,” in proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
[2] T. Li, J. Liu, W. Zhang, Y. Ni, W. Wang, and Z. Li, “Uav-human: A large
benchmark for human behavior understanding with unmanned aerial
vehicles,” in Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, 2021, pp. 16 266–16 275.
[3] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human
actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402,
2012.
[4] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a
large video database for human motion recognition,” in 2011 Interna-
tional conference on computer vision. IEEE, 2011, pp. 2556–2563.
[5] C. Feichtenhofer, “X3d: Expanding architectures for efficient video
recognition,” in Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, 2020, pp. 203–213.
[6] G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all
you need for video understanding?” in ICML, vol. 2, no. 3, 2021, p. 4.
[7] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks
for human action recognition,” IEEE transactions on pattern analysis
and machine intelligence, vol. 35, no. 1, pp. 221–231, 2012.
[8] S. Shinde, A. Kothari, and V. Gupta, “Yolo based human action
recognition and localization,” Procedia computer science, vol. 133, pp.
831–838, 2018.
[9] H. Peng and A. Razi, “Fully autonomous uav-based action recognition
system using aerial imagery,” in International symposium on visual
computing. Springer, 2020, pp. 276–290.
[10] T. Ahmad, M. Cavazza, Y. Matsuo, and H. Prendinger, “Detecting human
actions in drone images using yolov5 and stochastic gradient boosting,”
Sensors, vol. 22, no. 18, p. 7020, 2022.
[11] M. Ding, N. Li, Z. Song, R. Zhang, X. Zhang, and H. Zhou,
“A lightweight action recognition method for unmanned-aerial-vehicle
video,” in 2020 IEEE 3rd International Conference on Electronics and
Communication Engineering (ICECE). IEEE, 2020, pp. 181–185.
[12] X. Wang, R. Xian, T. Guan, C. M. de Melo, S. M. Nogar, A. Bera, and
D. Manocha, “Aztr: Aerial video action recognition with auto zoom and
temporal reasoning,” arXiv preprint arXiv:2303.01589, 2023.
[13] D. Kothandaraman, T. Guan, X. Wang, S. Hu, M. Lin, and D. Manocha,
“Far: Fourier aerial video recognition,” in European Conference on
Computer Vision. Springer, 2022, pp. 657–676.
[14] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
spatiotemporal features with 3d convolutional networks,” in Proceedings
of the IEEE international conference on computer vision, 2015, pp.
4489–4497.
[15] M. Frigo and S. G. Johnson, “Fftw: An adaptive software architecture
for the fft,” in Proceedings of the 1998 IEEE International Conference
on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No.
98CH36181), vol. 3. IEEE, 1998, pp. 1381–1384.
[16] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video
swin transformer,” arXiv preprint arXiv:2106.13230, 2021.
[17] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention
generative adversarial networks,” in International conference on machine
learning. PMLR, 2019, pp. 7354–7363.
[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
in neural information processing systems, 2017, pp. 5998–6008.
[19] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
L. Fei-Fei, “Large-scale video classification with convolutional neural
networks,” in Proceedings of the IEEE conference on Computer Vision
and Pattern Recognition, 2014, pp. 1725–1732.
[20] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
spatiotemporal features with 3d convolutional networks,” in Proceedings
of the IEEE international conference on computer vision, 2015, pp.
4489–4497.