ArticlePDF Available

Challenges and Limitations in Human Action Recognition on Unmanned Aerial Vehicles: A Comprehensive Survey

October 2021
Traitement du signal 38(5):1403-1411

October 2021
38(5):1403-1411

Authors:

Knowledge University

An Unmanned Aerial Vehicle (UAV), commonly called a drone, is an aircraft without a human pilot aboard. Making UAVs that can accurately discover individuals on the ground is very important for various applications, such as people searches, and surveillance. UAV integration in smart cities is challenging, however, because of problems and concerns such as privacy, safety, and ethical/legal use. Human action recognition-based UAVs can utilize modern technologies. Thus, it is essential for future development of the aforementioned applications. UAV-based human activity recognition is the procedure of classifying photo sequences with action labels. This paper offers a comprehensive study of UAV-based human action recognition techniques. Furthermore, we conduct empirical research studies to assess several factors that might influence the efficiency of human detection and action recognition techniques in UAVs. Benchmark datasets commonly utilized for UAV-based human action recognition are briefly explained. Our findings reveal that the existing human action recognition innovations can identify human actions on UAVs with some limitations in range, altitudes, long-distance, and a large angle of depression.

General procedure of UAV-based human action recognition methods

…

Sample of the aerial UCF-ARG dataset [21]

…

Sample of the Okutama-Action dataset [17]

…

Labeling activity classes in Okutama-Action dataset [17]

…

Sample of the VIRAT dataset [22]

…

Figures - uploaded by Nashwan Adnan Othman

Content may be subject to copyright.

Content uploaded by Nashwan Adnan Othman

Content may be subject to copyright.

Challenges and Limitations in Human Action Recognition on Unmanned Aerial Vehicles: A

Comprehensive Survey

Nashwan Adnan Othman1,2*, Ilhan Aydin2

1 Department of Computer Science, College of Science, Knowledge University, Erbil 44001, Iraq

2 Department of Computer Engineering, Firat University, Elazig 23200, Turkey

Corresponding Author Email: nashwan.adnan@knu.edu.iq

https://doi.org/10.18280/ts.380515

ABSTRACT

Received: 13 September 2021

Accepted: 12 October 2021

An Unmanned Aerial Vehicle (UAV), commonly called a drone, is an aircraft without a

human pilot aboard. Making UAVs that can accurately discover individuals on the ground

is very important for various applications, such as people searches, and surveillance. UAV

integration in smart cities is challenging, however, because of problems and concerns such

as privacy, safety, and ethical/legal use. Human action recognition-based UAVs can utilize

modern technologies. Thus, it is essential for future development of the aforementioned

applications. UAV-based human activity recognition is the procedure of classifying photo

sequences with action labels. This paper offers a comprehensive study of UAV-based human

action recognition techniques. Furthermore, we conduct empirical research studies to assess

several factors that might influence the efficiency of human detection and action recognition

techniques in UAVs. Benchmark datasets commonly utilized for UAV-based human action

recognition are briefly explained. Our findings reveal that the existing human action

recognition innovations can identify human actions on UAVs with some limitations in range,

altitudes, long-distance, and a large angle of depression.

Keywords:

human action recognition, human detection,

unmanned aerial vehicle, image processing,

smart city

1. INTRODUCTION

Unmanned aerial vehicles (UAVs) equipped with vision

technology have become extremely common in recent years

and are applied in a wide variety of areas. UAVs can be

utilized for traffic management, civil security control,

pollution monitoring, environmental monitoring, and

merchandise delivery. UAVs are in essence flying robots that

accomplish missions autonomously or under the remote

control of a human operator. The recent UAV technology

permits for operation in different regions, while sending

information and receiving commands from a single protected

ground station. Many of these technologies apply deep

learning and computer vision methods, mainly to detect

humans from the information captured by an onboard camera.

UAVs can assist police officers in enforcing security and

safety measures in smart cities. The combination of UAVs

with other technologies such as forensic mapping software,

secure and reliable wireless communications, video streaming,

and video-based abnormal human action recognition can help

make smart cities safer places to live [1, 2].

Human action recognition (HAR) is a dynamic and

demanding field of machine and deep learning, with security,

healthcare, sports, and robotics applications. Furthermore,

identifying human actions through activities can be used for

detecting falls in older people and detecting abnormal events.

HAR plays an important role in human-to-human

communication and interpersonal relations [3-5]. Moreover, it

is considered a vigorous field of research study that continues

to develop due to its latent applications in various areas [6].

Because HAR provides information about a person's identity,

psychological state, and personality along with detecting and

analyzing human physical actions, it is not easy to perform.

The human capability to identify another person's actions is

one of the core topics of studying the scientific areas of

machine learning and computer vision. Numerous applications,

including robotics for human behavior classification, video

surveillance systems, and human-computer interaction, need

multiple action recognition systems. The identification of

human actions, especially from videos captured by UAVs, has

attracted the attentiveness of numerous researchers. However,

recognizing human activity from video sequences captured by

drones remains a challenging problem because of many

restrictions correlated to the platform, such as perspective

contrast, dynamic and complicated background, human

parallax, and camera height [7].

In recent years, cities worldwide have begun to enhance

modern smart city infrastructure, which can only be done with

the help of the use of the latest technologies. Likewise,

researchers from different fields have become increasingly

interested in the concept of smart cities. Considering that there

is so much information about the environment in intelligent

cities, it's interesting to apply approaches to characterize the

different domains and detect human behaviors and specific

situations. Digital transformation has become a global demand

for all people who live in cities and improves the quality of life

for citizens in the country. Smart cities improve people's living

standards and make them feel safer with the provision of 24/7

security. The main goal of intelligent city design is to provide

efficient infrastructures and services at reduced costs. UAVs

provide the necessary services to achieve the required goals in

intelligent cities. UAV applications, among several others, can

provide cost-effective services to help achieve the objectives

of smart cities. Integration of UAVs with other technologies

Traitement du Signal

Vol. 38, No. 5, October, 2021, pp. 1403-1411

Journal homepage: http://iieta.org/journals/ts

1403

like unusual human action recognition can create safer

intelligent city environments [1]. With the help of HAR, it is

an effective solution in many areas to monitor human actions

in UAV video frames for intelligent cities and determine the

most unusual human actions. Furthermore, human action

recognition can be used to orientate a drone.

A commonly used technique in UAV-based HAR is the

deep learning technique. Deep learning is an advanced and

efficient section of machine learning methods that comes from

biological neural networks to resolve several issues in natural

language processing, bioinformatics, computer vision, and

other scopes. Deep learning permits us to automate everyday

jobs. For instance, we can utilize deep learning to detect things

inside a picture, text classification, and modify text to audio

and vice versa [8, 9]. In the case of neural networks, a multi-

layer perceptron (MLP) with more than two hidden layers can

be identified as a deep model. Commonly used layers are the

convolution layer, fully connected layer, ReLU layer, pooling

layer, and dropout layer. Deep learning is based on a set of

algorithms that learn to represent the data; the most common

algorithms are Deep Auto-Encoders, Convolutional Neural

Networks (CNN), Recurrent Neural Networks, and Deep

Belief Networks [10, 11].

This paper aims to understand the limits of the present HAR

modern technologies implemented in UAVs and offer possible

guidelines for integrating HAR into UAV-based applications.

UAVs may fly indoors or outdoors under any lighting or

environmental conditions and might take images from the air

with any possible combination of the angle of depression and

altitude. In this survey, we carry out a collection of empirical

research studies to examine the capacity of some preferred

approaches in recognizing specific human actions on images

gathered by UAVs. The impacts caused by distances and angle

of depression from the UAVs to the subjects are investigated

to methodically examine the limits of existing HAR

technologies when performed on UAVs [1].

The rest of this paper is arranged as follows: In section 2, a

comparative study of UAV-based HAR methods is explained.

Commonly used UAV-based HAR benchmark datasets are

showed in section 3. In Section 4, the challenges and

limitations and the suggested approaches are explained.

Finally, in Section 5, the paper is concluded with a future

works scope.

2. COMPARATIVE UAV-BASED HAR METHODS

Human action recognition (HAR) is a dynamic and

demanding field of computer vision and deep learning with

applications in security, human fall detection, human-

computer interaction, visual surveillance, healthcare, sports,

and robotics. Furthermore, HAR can be related to behavior

biometrics, which involves understanding approaches and

their algorithms to identify a human uniquely based on their

behavior signs. On the other hand, the combination of UAVs

with other innovations like video-based abnormal movement

detection, video streaming, and video-based unusual HAR can

aid smart cities and risk-free living places. Recently, low cost

and lightweight devices have made UAVs a good candidate

for surveillance of human activities. UAV-based HAR

methods play their part in finding the video segments that

contain the chosen activities.

The general procedure of UAV-based HAR consists of three

main stages. The first step is the acquisition of the input frames

by using a UAV camera. Later, in the human classification

stage, the detection of humans through the generated machine

learning or deep learning models. Finally, the HAR model load

to recognize human actions. Figure 1 shows the general

process of UAV-based HAR.

Figure 1. General procedure of UAV-based human action

recognition methods

This section discusses the numerous methods adjusted for

UAV-based HAR. Table 1 presents a comparison table

showing for different methodologies.

Recently, a UAV-based HAR framework was suggested by

Mliki et al. [7]. There has been an increasing rate of attention

paid to training the generated activity recognition model

utilizing multi-task learning. They used two phases, which are

the offline phase and inference phase. The offline phase

creates the human identification and human action models

utilizing a pre-trained CNN. The inference phase enables the

discovery of human beings and their actions via the generated

models. In their paper, scene stabilizing preprocessing was

used to establish the potential activity areas in the scene. Then,

automatic extraction of spatial features was performed to

create a human action model. The extraction, as well as the

learning of these attributes, were accomplished by utilizing a

pre-trained version. Mliki et al. utilized the GoogLeNet

architecture, as it provides a great compromise amongst

calculation time and also classification error rate. They

observed that GoogLeNet integrates nine Inception modules

that comprise convolutions by various sizes permitting the

learning of features at various ranges. In addition, they keep in

mind that the penultimate fully connected layer is exchanged

via a pooling layer in the GoogLeNet architecture. This

technique decreases the size of feature maps from (n×n×nc) to

(1×1×nc); where nc is the size of the input feature mapping

channel. For that reason, the overall number of parameters is

minimized, which reduces the calculation time. To adjust the

GoogLeNet Architecture to the action recognition system,

they exchanged the softmax layer of the pre-trained model

with an additional softmax layer. At the end of this stage, they

got a CNN model that explains all human actions. A

comparison performance for the HAR approach per test set

regarding the precision rate on the UCF-ARG dataset is 56%.

Sultani and Shah [12] proposed using game videos and

Generative Adversarial Networks (GAN) and created aerial

features to enhance UAV-based HAR when limited genuine

aerial samples are presented. Their strategy doesn’t need the

same labels for a game and actual activities. To deal with

diverse activity labels in the game and the actual dataset, they

suggest utilizing a Disjoint Multitask Learning (DML) method

to acquire activity classifiers effectively. Their experimental

outcomes and detailed evaluation demonstrated that video

game activity and GAN-produced instances could help to get

enhanced aerial recognition accuracy when combined

1404

appropriately.

Sultani and Shah [12] presented two new action datasets.

The first dataset is a game action dataset that comprises seven

human activities. There are 100 aerial ground video pairs for

each activity, and the second one is a real aerial dataset

including eight activities of UCF-ARG. In their paper, DML

was applied for games, GAN-generated aerial video footage,

and actual aerial video footage. To calculate the features of

limited existing genuine aerial videos and gameplay videos by

utilizing 3D CNN and GAN-generated aerial features was

done by utilizing GAN [13]. Two fully connected layers are

shared amongst each task, and one fully connected layer for

every task is utilized. Furthermore, the researchers did not

believe that the diversity of activities in both data sets was the

same. They trained every four sections for classification,

utilizing softmax as the last activation function and cross-

entropy loss. They revealed that video game and GAN-

generated activity samples can assist in discovering a more

precise activity classifier with a DML structure.

Perera et al. [14] utilized an inexpensive hovering UAV to

record 13 lively human activities. Their dataset consists of 240

HD videos for an overall of 44.6 mins and is composed of

66,919 frames. The dataset was gathered from a low height

and reduced speed to record the optimum human position

information with reasonably high resolution. Evaluating the

dataset explores two well-known feature kinds utilized in

HAR, precisely, Pose-based CNN (P-CNN) [15] and High-

Level Pose Features (HLPF) [16]. P-CNN was utilized as the

standard activity recognition method. P-CNN uses the CNN

attributes of body system parts extracted utilizing the

predictable posture. Here, CNN architectures are produced

from person-centric activity as well as appeal features

extractor utilizing body system joint positions. For this task,

they utilized the offered P-CNN code with slight

customizations. HLPF acknowledges activity classes based on

the temporal relations of physical body junctions and their

varieties. HLPFs are created through blending temporal and

spatial properties of body system key points throughout the

activity. They used the openly offered HLPF code with slight

customizations. The HLPF was computed utilizing 15 main

points (head, elbows, wrists, neck, shoulders, hips, knees,

abdomen, and ankles). The total baseline activity recognition

precision computed utilizing P-CNN was 75.92%. Moreover,

baseline precision and experimentation details were compared

with newly available human action data sets.

Barekatain et al. [17] presented a model by using Single

Shot MultiBox Detector (SSD) [18] to detect objects, classify

activity, and assess it on both tasks with their Okutama-Action

dataset. SSD was used for finding pedestrians in the data set.

Then, the same model was utilized for action detection. The

action detection model adheres to a two-stream method, which

may be separated into three phases. SSD is the object detector

utilized in the initial phase to obtain the place and the class of

activities as detection boxes. Another phase combines

detection and classification scores for each of the streams to

incorporate the appeal and motion cues coming from the

optical and natural flow photos. In the third phase, detection

sequences are utilized to incrementally create activity

pipelines. They noticed that the activities firmly similar to

temporal parts have low precision. For example, walking is

often confused with running, and this is most possible since

they only differentiate classes at a frame rate. Furthermore,

both pressing and carrying are more effortlessly classified,

which they think is by reason of the size and dimension of the

objects in the frames.

Table 1. Comparison of different methods of HAR algorithms

Authors

and year

Title

Activities

Algorithm

Dataset

Accuracy

Mliki et al.

[7]

Human activity

recognition from

UAV-captured video

sequences

Recognize 10 different

activities like Boxing,

Digging, Running etc.

Convolutional Neural Network

Model (Google-Net architecture)

UCF-ARG dataset [21]

56%

Low accuracy is

obtained.

Sultani and

Shah [12]

Human Action

Recognition in Drone

Videos using a Few

Aerial Training

Examples

Game action dataset

recognize 7 different

activities.

Disjoint Multitask Learning (DML)

for human activity recognition

model generation and Wasserstein

Generative Adversarial Networks

(W-GAN) to produce aerial features

from ground frames.

1) Aerial-Ground game

data set

2) UCF-ARG

3) GAN-generated aerial

features.

4)YouTube-Aerial

dataset

64.5%

DML is limited

according to the

necessity of the

accessibility of various

labels for every task for

the equivalent data.

Perera et al.

[14]

Drone-Action: An

Outdoor Recorded

Drone Video Dataset

for Action

Recognition

Recognizes 13

dynamic human

actions like punching,

kicking, walking,

stabbing, jogging, and

running.

Pose-based Convolutional Neural

Network (P-CNN)

They utilized their own

dataset (Drone-Action

dataset) that comprises

240 HD videos

consisting of 66,919

frames.

75.92%

Dataset gathered at low

speed from low-altitude.

Barekatain

et al. [17]

Okutama-Action: An

Aerial View Video

Dataset for

Concurrent Human

Action Detection

Recognizes 12 human

actions such as

Running, Walking,

and Pushing.

CNN (SSD Model)

They utilized their own

dataset (Okutama Action

Dataset) that comprises

43-minute-long

sequences.

18.80 %

The accuracy obtained is

a too low cause of the

high-resolution aerial

view.

Liu and

Szirányi

[19]

Real-Time Human

Detection and

Gesture Recognition

for On-Board UAV

Rescue

Recognizes 10

different human

actions such as Stand,

Walk, and Phone Call.

Deep Neural Network (DNN) model

and OpenPose algorithm

They utilized their own

dataset.

99.80%

Very high accuracy

obtained but at a low

altitude.

1405

Liu and Szirányi [19] proposed a real-time human-detection

and gesture-recognition system to rescue in-flight drones. The

drone detects a human at a longer distance along with a

resolution of 640 x 480. Also, the system shows an alert to

enter into the recognition phase immediately after a person is

sensed. A dataset consists of 10 actions generated by a UAV

camera, like kicking, punching, standing, squatting, and sitting.

The two most vital dynamic gestures are the new dynamic

attention and cancel, which are the adjustment and reset

functions, respectively, with which users can establish a

connection with the drone. After the cancellation gesture is

identified, the system will automatically turn off, and after the

alarm gesture is identified, the customer can create an

additional connection with the UAV. The system gets into the

last hand gesture identification phase to help the customer.

When the rescue motion of the body is identified as a warning,

the UAV will progressively approach the customer more

efficiently to recognize the hand gestures. The OpenPose [20]

method is utilized to grab the customer's skeleton and discover

its joints. Liu, Chang Liu, et al. trained and tested the model

by constructing a Deep Neural Network (DNN). After training

for 100 repetitions, the model reaches 99.79% accuracy

according to the training data and 99.80% precision according

to the test data. They used a dataset gathered online using their

own definitions for the last phase of the hand gesture

recognition to achieve the corresponding trained dataset using

a CNN to achieve a model that can obtain hand gesture

recognition. The UAV flies at the height of about three meters

and flies diagonally overhead the user. However, there are

some limitations and challenges when applying the system to

the natural wilderness. Another restriction is the flight location

of the UAVs. Their system requires that UAVs fly over

persons at an angle to more accurately sense human body

movements, rather than placing the UAV vertically over the

person’s head. Therefore, more time is needed to collect

sufficient experience data. Battery life limits are another

requirement. This method can instantly retrain a model

dependent on new information to generate a new model in a

short period with new rescue efforts.

3. COMMON DATASETS

A limited number of aerial data sets are readily available in

the field of human activity recognition. Most data sets are

limited to indoor scenes or tracking objects. Also, numerous

external data sets do not contain enough detail about the

human body to apply the latest deep learning techniques. Five

of the most common aerial human action recognition datasets

are the UCF-ARG (University of Central Florida-Aerial

camera, Rooftop camera, and Ground camera) dataset [21],

Games action dataset [12], Okutama-Action dataset [17],

VIRAT dataset [22] and Drone-Action dataset [14]. We will

describe some general datasets for human action recognition

based UAVs, as in Table 2.

Table 2. Different types of human behavior identification data sets based on UAV

Dataset

Stimuli

Number

Actions

Types of Actions

Resolution

Camera

Ref.

UCF-ARG

dataset

1440 video

clips

running, clapping, carrying, digging,

boxing, jogging, walking, throwing,

waving, open-close trunk.

1920x1080

pixels (FHD)

A rooftop

camera, an

aerial camera, a

ground camera

UCF Vision, CRCV | Center for

Research in Computer Vision at

the University of Central

Florida, 2011 [21]

Games-

Action

dataset

200 video clips

fighting, running, cycling, kicking a

football, shooting, skydiving, walking

720x480

pixels (HD)

aerial gameplay

video (FIFA

game and GTA

V game)

Waqas Sultani et al., Human

Action Recognition in Drone

Videos using a Few Aerial

Training Examples, 2021 [12]

Okutama-

Action

dataset

43 minute-

long fully-

annotated

sequences

handshaking, hugging, drinking, carrying,

pushing, calling, reading, running,

walking, lying, sitting, standing.

3840x2160

pixel (4K)

UAV camera

Barekatain et al., Drone-Action:

An Outdoor Recorded Drone

Video Dataset for Action

Recognition, 2017 [17]

VIRAT

dataset

550 video clips

standing, crouching, sitting, walking,

running, falling, gesturing, distress,

aggressive, talking on phone, texting on

phone, digging, using tool, throwing,

kicking, umbrella

720x480

pixels (HD)

fixed and

moving cameras

IARPA DIVA program,

Viratdata / viratannotations,

2020 [22]

Drone-

Action

dataset

240 video clips

walking front/back, walking side,

punching, clapping, jogging side, hitting

with bottle, hitting with stick, jogging

front/back, kicking, running front/back,

running side, stabbing, waving hands.

1920x1080

pixels (FHD)

UAV camera

Asanka G. Perera et al., Drone-

Action: An Outdoor Recorded

Drone Video Dataset for Action

Recognition, 2019 [14]

3.1 UCF-ARG dataset

The UCF-ARG dataset is a multi-view human action data

set. The UCF-ARG contains ten human actions carried out by

twelve actors gathered from a rooftop camera at the height of

100 feet, an aerial camera, and a ground camera. The UCF-

ARG dataset contains different human actions, such as boxing,

digging, running, and walking. Figure 2 shows a sample of the

aerial UCF-ARG dataset. Every action is executed four times

per actor in different directions. The open-close trunk action is

executed only three times, on three cars parked in various

orientations. Actions are gathered using an HD video camera

at 1920 X 1080 resolution with 60 frames per second.

3.2 Games-Action dataset

FIFA (International Football Association) and GTA V

(Grand Theft Auto) are utilized to collect the game motion

1406

dataset. Data is gathered when a player performs the same

activity in the game from several viewpoints. FIFA and GTA

permit users to record activities from several viewpoints, with

real-looking scenes and various realistic camera movements.

Altogether, the two games provided dataset with seven

activities, including fighting, running, cycling, kicking a

football, shooting, skydiving, and walking. Since there are

many football kicks in FIFA games, kicks are gathered from

that game, while the other activities are gathered from GTA V.

Even though they only utilize aerial gameplay video in their

current approach, they also capture aerial and ground video

pairs. That is, the same activity frames are gathered from

ground and aerial cameras. Figure 3 shows two frames per

activity for both ground and aerial views—rows one, three,

five, and seven show aerial videos; rows two, four, sixth, and

eight show ground videos. The dataset consists of 200 videos

(100 aerial and 100 ground) for all actions.

Figure 2. Sample of the aerial UCF-ARG dataset [21]

Figure 3. Two frames per activity for ground and aerial

scenes from the game's action data set [12]

3.3 Okutama-Action dataset

The Okutama-Action dataset is an aerial view video dataset

for simultaneous human activity detection. This video dataset

comprises 43-minute sequences at 30 Frames Per Second

(FPS), and 77,365 frames in 4K resolution were introduced to

detect 12 human activities, including handshaking, drinking,

carrying, and reading. The dataset was gathered utilizing two

drones hovering at altitudes changing amongst 10-45 meters

and a camera angle of 45 or 90 degrees. Okutama-Action

contains many challenges missing from existing datasets,

including dynamic motion transitions, significant changes in

size and aspect ratios, snap camera movements, and multi-

level actors. This dataset is more compelling than other

existing datasets and will drive the field forward to enable real-

world applications. Up to nine agents perform different actions

in sequence in each video, and they present a real challenge

for multi-brand actors, as the actor plays multiple roles

simultaneously. All Okutama-Action videos were filmed from

a UAV at a baseball stadium in Okutama, Japan. Figure 4

shows the number of samples of the Okutama-Action dataset.

The dataset contains video samples of human activities that

reflect everyday activities. The Okutama dataset groups

actions into three types. Figure 5 shows every activity class

and their corresponding groups.

Figure 4. Sample of the Okutama-Action dataset [17]

Figure 5. Labeling activity classes in Okutama-Action

dataset [17]

3.4 VIRAT dataset

VIRAT is a human action recognition dataset consisting of

550 video clips that cover a range of actual and controlled

human activities. The dataset was collected from moving and

fixed cameras and is named the VIRAT ground and aerial

datasets. The VIRAT dataset is limited due to its low

resolution of 480 x 720 pixels, limiting the algorithm’s ability

to remember rich action information from relatively small

humans. Figure 6 displays the samples of the VIRAT public

dataset.

1407

Figure 6. Sample of the VIRAT dataset [22]

3.5 Drone-Action dataset

The Drone-Action dataset is an HAR dataset consisting of

13 human activity classes captured in FHD (1920 x 1080

resolution) and 25 FPS from a low altitude (8-12m). A total of

13 activities were gathered while the UAV was flying and in

following and hovering mode. Figure 7 displays the samples

of the Drone-Action dataset.

Figure 7. Sample of the Drone-Action dataset [14]

Some of the Drone-Action dataset activities were gathered

while the UAV was flying, such as stabbing, kicking, and

punching, while others were gathered while the UAV tracked

the subject, like running, walking, running, and jogging. Each

video clip was gathered in such a way as to preserve the largest

possible surface area of the body. This dataset was designed to

support situational awareness, case assessment, monitoring,

search and rescue-related research, and activity recognition.

Finally, we noticed that there are some rules and limitations

that an autonomous drone must follow while gathering

datasets, specifically in the field of HAR:

• Avoid high-speed flying, and, accordingly, motion blur.

• Avoid flying at very high altitudes to preserve adequate

frame resolution.

• Avoid flying at very low altitudes, as this poses a

danger to humans and equipment.

• Recording of human elements from this point of view

gives minimal perspective distortion.

• Hover for more details on exciting scenes.

4. CHALLENGES AND LIMITATIONS

Limited work has been done to understand the complex

human actions captured from a UAV. Some issues remain

open and merit further investigation of the UAV-based HAR.

Distances between the UAV and their targets directly affect

the size of the human body in pixels. Because UAVs take

aerial photos, their altitude keeps them away from their ground

targets. Altitudes also create landing angles for the UAVs to

their targets, so the tilt angles of the human images gathered

through the UAVs can be significant. Speed and flight position

can also affect the quality of human pictures and reduce the

performance of HAR. This article mainly explores how

distances, tilt angles, and other factors affect UAV-based HAR

performance, as effects from speed and flight can be offset by

appropriate settings in aerial cameras. Common factors for

slow progress in recognition of human actions in aerial

footages include the following:

• It is difficult to accurately determine the human’s action

from frames taken by using UAV due to a variety of

camera angles and altitude.

• The performance in determining human actions with deep

learning methods is lower than other classical methods.

• Using DNN to automatically recognize air actions is

problematic, because deep-learning models are data-

hungry and require hundreds of human air action training

videos for robust training. But collecting large numbers of

aerial videos for humans is very difficult, time-consuming,

and costly.

• Insufficient relevant video datasets exist to assist

algorithms in recognizing human actions in the airspace.

Recently, there have been some datasets to support UAV-

based HAR studies, but they are limited.

• A wide crowd area is needed for collecting the data and

testing and analyzing the results in real-time videos.

• Recognizing human action is an inherently complex

problem. Most action recognition studies focus on

standard video data sets, usually ground-level videos.

Learning the latest technology is still challenging, even

when using high-quality videos.

• Another limitation of UAVs is their limited battery life.

An upcoming effort may include designing algorithms

that run on low-power gadgets.

• Many UAV applications need internet connections rather

than offline processing. Nevertheless, resource limitations

for embedded platforms limit the selection and difficulty

of activity recognition methods.

• Aerial video quality often lacks occlusion, image detail,

camera movements, and perspective distortion.

• Automatic recognition of human activities on UAV

frames is discouraging. It is difficult to control the UAV

camera movement and small-sized actors.

• The finer details of the UAV and human activities will

vary according to the field of application. For instance, the

primary concern of the monitoring system is often to find

unusual behavior, such as jumping over a fence and

falling.

• System performance depends on significant differences in

the action class. For instance, the action of walking and

running differs only to a small degree. An excellent

human action recognition must be capable of

distinguishing the actions of one class from another.

• In background modeling, the main problems include the

1408

gradual change of small movements of unsteady objects

such as tree branches and shrubs, illumination conditions

in the scene, endless variations, and flying in wind noise

due to poor image source, display objects in location,

multiple animated objects in a long and short scene,

lightning contrast, dynamic background, internal scale

contrast, blur, and shadows.

• HAR becomes challenging when there is a change in style,

view invariance, human change, and changes in clothing.

The distinction between similar activities and dealing

with human object communication is still an open

research area.

• Tracking multiple objects is complex, and identifying

anomalies such as fraud detection and abnormal crowd

behavior within an inadequate number of training datasets

is problematic.

Based on the above, recognizing human activity utilizing

aerial video frames is less familiar and less studied than

recognizing general human activities. Artificial intelligence

researchers have sought to explore human actions in various

types of video frames, including game videos, sports videos,

and surveillance. However, insufficient research has been

done to recognize human actions in UAV video frames,

despite this field being very helpful and of practical

importance.

To achieve an accurate HAR system for UAVs, we

recommended the following criteria:

• It is considered that more accurate results will be obtained

if an extensive dataset is used to determine human actions

more accurately.

• Different points of view can contribute to the success of

methods in determining human actions with CNN

architectures such Mobilenet, Inception, VGG, and

Resent.

• It should be considered to take the frames with a high-

quality camera capable of capturing frames from a wide

angle to analyze video frames from above.

• Developers and researchers have found that embedded

platforms like Raspberry Pi and NVIDIA Jetson are the

perfect platforms to realize Artificial Intelligence

applications on their UAVs.

• CNN architectures like the Mobilenet network have been

developed to resolve performance problems for embedded

vision applications, mobile devices, and UAVs.

Lastly, our analysis reveals that the most critical factors

affecting UAV-based HAR's performance are the angle of

view, flight altitude, inadequate datasets, long-distance and

UAV camera movements. Due to these factors, existing UAV-

based human action recognition innovations are limited in

terms of accuracy. Table 3 demonstrates a number of

recommendations with which the impact of each factor can be

reduced and it improves the performance of UAV-based HAR

as a solution to obtain a satisfactory accuracy.

Succinctly, utilizing the above recommended solutions, and

using a smaller number of parameters during the training of

deep leaning models, we can acquire more accuracy and

increase the performance of the UAV-based HAR system. In

addition, the performance of UAV-based HAR can be

improved by using powerful deep learning techniques,

collecting more data and integrating with the available datasets,

and dedicating more costs to achieve a UAV with 4k camera

resolution that has an extensive battery life.

Table 3. Recommended solutions to reduce the impact of most common factors

Factors

Recommended solutions

Variety of camera angles and altitude

The impact of these factors can be reduced by using a wide-angle camera

that has been available recently, such as a UAV with a 180-degree wide-

angle camera

The performance of human actions with deep learning

methods

The impact of this factor can be reduced by using the powerful embedded

platforms that are capable of running with GPU and train with a model like

Mobile net architectures

Deep-learning models require hundreds of human air action

training videos and collecting large numbers of action data is

difficult

The impact of this factor can be reduced by extracting action frames in the

games or by creating a new one, or mixing all datasets in the field

Lack of sufficient datasets to support UAV-based HAR studies

The impact of this factor can be reduced by collecting the datasets with the

help of recently released efficient UAV’s

Aerial video quality, limited battery life, and small-sized

actors when the UAV flies in a high altitude, illumination

conditions in the scene, endless variations, flying in wind

noise, lightning contrast, dynamic background, contrast, blur,

and shadows

The impact of these factors can be reduced by using a recently available

UAV. The UAV may include high battery life and 4k camera that can

improve the quality of the aerial video. In addition, using techniques to

extract the region of interest (ROI) can reduce the dynamic background

issues

UAV Camera movements

The latest video stabilization techniques can be used to reduce camera

movements

Network problem

The latest embedded platforms like Nvidia Jetson can easily solve network

problems

Significant differences in the action class

HAR system is able to distinguish the actions of one class from another by

collecting more and more dataset

Change in style, view invariance, human change, and changes

in clothing, the complexity of tracking multiple objects, and

identifying anomalies and abnormal crowd behaviour

The impact of these factors can also be reduced by collecting more data

5. CONCLUSIONS

UAV-based recognition of human actions is an active area

of research study, and this technology has come a long way

over the past two decades. This paper extensively discusses the

techniques and limitations of UAV-based human activity

recognition. The survey showcased recently published

research papers on various UAV-based HAR technologies in

1409

aerial images and video frames. The main objective was to

provide a comprehensive survey and compare various UAV-

based HAR methods. Public datasets aimed at evaluating

approaches from multiple perspectives were also briefly

explained. In addition, some difficulties and limitations were

highlighted. In summary, the literature on HAR shows that the

system still suffers from some limitations. For example, some

activities have low recognition rates. More research is required

to enhance accuracy and growth of the number of actions the

system detects. In the coming years, we expect UAV-based

HAR to become a great option with high-computing

technology machines that can process large amounts of data in

a shorter time using a vision-based approach. In this paper, the

effects of some factors like distances, angles of depression,

and altitudes on HAR performance in UAVs were investigated.

Through the empirical studies in the literature, we concluded

that UAV-based HAR techniques can adequately perform on

UAVs. However, for these technologies to unlock their full

potential, some obstacles must be considered. The small-sized

human images captured by UAVs from long distances are

troubling challenges in both human detection and in the

classification of actions. Also, differences in posture presented

by large depression angles significantly weaken human

detection and action recognition accuracy. Per contra, the

recognition model enriched with 3D modeling techniques can

improve UAV-based HAR performance in the case of large

depression angles, but this increase may also reduce the ability

to distinguish between humans in standard conditions and

therefore requires further research.

In the future, there will be some performance problems that

need to be resolved for real-time deployment, such as a change

in appearance, high computational cost, camera view change,

lighting, and low classification rate. In addition, a limited

number of aerial footage datasets are accessible in the field of

HAR. Most datasets are limited to interior scenes or tracking

objects. Many external datasets do not contain enough details

about the human actions to apply the latest methods in

machine learning and deep learning. To fill this gap and allow

research study in broader application areas, we planned to

generate a new external dataset that includes most everyday

human actions and especially abnormal ones. Also, as a future

work, we would like to train powerful deep learning models

using the MobileNet architecture that can handle multi-label

output for multi-action description set processing. In the future,

more studies need to be done on how air camera parameters

such as accuracy and compression ratio affect HAR

performance in UAVs, as the size of humans greatly

influences HAR performance. In addition, Wide Field of View

(FOV) cameras not only grab wide scenes in images but also

create morphs at the edges of the images. It is also worth

investigating to compensate for the adverse effects caused by

these forms. Constraints on network bandwidth, batteries, and

computing power of the embedded system supported via the

UAV limit how HAR can be performed in this scenario. The

development of a UAV-based system that enables the

recognition of human actions and is balanced in accuracy,

computation, network transmission, and energy consumption

will be part of the scope of our future work.

REFERENCES

[1] Mohamed, N., Al-Jaroodi, J., Jawhar, I., Idries, A.,

Mohammed, F. (2020). Unmanned aerial vehicles

applications in future smart cities. Technological

Forecasting and Social Change, 153: 119293.

https://doi.org/10.1016/j.techfore.2018.05.004

[2] Yaacoub, J.P., Noura, H., Salman, O., Chehab, A. (2020).

Security analysis of drones systems: Attacks, limitations,

and recommendations. Internet of Things, 11: 100218.

https://doi.org/10.1016/j.iot.2020.100218

[3] Zhang, N., Wang, Y., Yu, P. (2018). A review of human

action recognition in video. In 2018 IEEE/ACIS 17th

International Conference on Computer and Information

Science (ICIS), pp. 57-62.

https://doi.org/10.1109/ICIS.2018.8466415

[4] Agahian, S., Negin, F., Köse, C. (2020). An efficient

human action recognition framework with pose-based

spatiotemporal features. Engineering Science and

Technology, an International Journal, 23(1): 196-203.

https://doi.org/10.1016/j.jestch.2019.04.014

[5] Mottaghi, A., Soryani, M., Seifi, H. (2020). Action

recognition in freestyle wrestling using silhouette-

skeleton features. Engineering Science and Technology,

an International Journal, 23(4): 921-930.

https://doi.org/10.1016/j.jestch.2019.10.008

[6] Aydin, I. (2018). Fuzzy integral and cuckoo search based

classifier fusion for human action recognition. Advances

in Electrical and Computer Engineering, 18(1): 3-10.

https://doi.org/10.4316/AECE.2018.01001

[7] Mliki, H., Bouhlel, F., Hammami, M. (2020). Human

activity recognition from UAV-captured video

sequences. Pattern Recognition, 100: 107140.

https://doi.org/10.1016/j.patcog.2019.107140

[8] Othman, N.A., Aydin, I. (2018). A new deep learning

application based on movidius NCS for embedded object

detection and recognition. 2018 2nd International

Symposium on Multidisciplinary Studies and Innovative

Technologies (ISMSIT), pp. 1-5.

https://doi.org/10.1109/ISMSIT.2018.8567306

[9] Othman, N.A., Al-Dabagh, M.Z.N., Aydin, I. (2020). A

new embedded surveillance system for reducing

COVID-19 outbreak in elderly based on deep learning

and IoT. In 2020 International Conference on Data

Analytics for Business and Industry: Way Towards a

Sustainable Economy (ICDABI), pp. 1-6.

https://doi.org/10.1109/ICDABI51230.2020.9325651

[10] Othman, N.A., Aydin, I. (2019). A smart school by using

an embedded deep learning approach for preventing fake

attendance. In 2019 International Artificial Intelligence

and Data Processing Symposium (IDAP), pp. 1-6.

https://doi.org/10.1109/IDAP.2019.8875883

[11] Chriki, A., Touati, H., Snoussi, H., Kamoun, F. (2021).

Deep learning and handcrafted features for one-class

anomaly detection in UAV video. Multimedia Tools and

Applications, 80(2): 2599-2620.

https://doi.org/10.1007/s11042-020-09774-w

[12] Sultani, W., Shah, M. (2021). Human action recognition

in drone videos using a few aerial training examples.

Computer Vision and Image Understanding, 206:

103186. https://doi.org/10.1016/j.cviu.2021.103186

[13] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.

(2020). Generative adversarial networks.

Communications of ACM, 63(11): 139-144.

https://doi.org/10.1145/3422622

[14] Perera, A.G., Law, Y.W., Chahl, J. (2019). Drone-action:

An outdoor recorded drone video dataset for action

1410

recognition. Drones, 3(4): 82.

https://doi.org/10.3390/drones3040082

[15] Chéron, G., Laptev, I., Schmid, C. (2015). P-CNN: Pose-

based CNN features for action recognition. In

Proceedings of the IEEE International Conference on

Computer Vision, pp. 3218-3226.

https://doi.org/10.1109/ICCV.2015.368

[16] Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.

(2013). Towards understanding action recognition. 2013

IEEE International Conference on Computer Vision, pp.

3192-3199. https://doi.org/10.1109/ICCV.2013.396

[17] Barekatain, M., Martí, M., Shih, H.F., Murray, S.,

Nakayama, K., Matsuo, Y., Prendinger, H. (2017).

Okutama-action: An aerial view video dataset for

concurrent human action detection. 2017 IEEE

Conference on Computer Vision and Pattern Recognition

Workshops (CVPRW), pp. 28-35.

https://doi.org/10.1109/CVPRW.2017.267

[18] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,

Fu, C.Y., Berg, A.C. (2016). SSD: Single shot multibox

detector. In: Leibe B., Matas J., Sebe N., Welling M. (eds)

Computer Vision – ECCV 2016. ECCV 2016. Lecture

Notes in Computer Science, vol 9905. Springer, Cham.

https://doi.org/10.1007/978-3-319-46448-0_2

[19] Liu, C., Szirányi, T. (2021). Real-time human detection

and gesture recognition for on-board UAV rescue.

Sensors, 21(6): 2180. https://doi.org/10.3390/s21062180

[20] Cao, Z., Simon, T., Wei, S.E., Sheikh, Y. (2017).

Realtime multi-person 2D pose estimation using part

affinity fields. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, pp. 7291-

7299. https://doi.org/10.1109/TPAMI.2019.2929257

[21] CRCV|Center for Research in Computer Vision at the

University of Central Florida, (n.d.).

https://www.crcv.ucf.edu/data/UCF-ARG.php, accessed

on July 2, 2021.

[22] VIRAT Video Data, (n.d.). https://viratdata.org/,

accessed on July 2, 2021.

1411

Development of a Novel Lightweight CNN Model for Classification of Human Actions in UAV-Captured Videos

Article

Full-text available

Feb 2023

There has been increased attention paid to autonomous unmanned aerial vehicles (UAVs) recently because of their usage in several fields. Human action recognition (HAR) in UAV videos plays an important role in various real-life applications. Although HAR using UAV frames has not received much attention from researchers to date, it is still a significant area that needs further study because of its relevance for the development of efficient algorithms for autonomous drone surveillance. Current deep-learning models for HAR have limitations, such as large weight parameters and slow inference speeds, which make them unsuitable for practical applications that require fast and accurate detection of unusual human actions. In response to this problem, this paper presents a new deep-learning model based on depthwise separable convolutions that has been designed to be lightweight. Other parts of the HarNet model comprised convolutional, rectified linear unit, dropout, pooling, padding, and dense blocks. The effectiveness of the model has been tested using the publicly available UCF-ARG dataset. The proposed model, called HarNet, has enhanced the rate of successful classification. Each unit of frame data was pre-processed one by one by different computer vision methods before it was incorporated into the HarNet model. The proposed model, which has a compact architecture with just 2.2 million parameters, obtained a 96.15% success rate in classification, outperforming the MobileNet, Xception, DenseNet201, Inception-ResNetV2, VGG-16, and VGG-19 models on the same dataset. The proposed model had numerous key advantages, including low complexity, a small number of parameters, and high classification performance. The outcomes of this paper showed that the model’s performance was superior to that of other models that used the UCF-ARG dataset.

VT-BPAN: vision transformer-based bilinear pooling and attention network fusion of RGB and skeleton features for human action recognition

Article

Full-text available

Dec 2023
MULTIMED TOOLS APPL

Recent generation Microsoft Kinect Camera captures a series of multimodal signals that provide RGB video, depth sequences, and skeleton information, thus it becomes an option to achieve enhanced human action recognition performance by fusing different data modalities. However, most existing fusion methods simply fuse different features, which ignores the underlying semantics between different models, leading to a lack of accuracy. In addition, there exists a large amount of background noise. In this work, we propose a Vision Transformer-based Bilinear Pooling and Attention Network (VT-BPAN) fusion mechanism for human action recognition. This work improves the recognition accuracy in the following ways: 1) An effective two-stream feature pooling and fusion mechanism is proposed. The RGB frames and skeleton are fused to enhance the spatio-temporal feature representation. 2) A spatial lightweight multiscale vision Transformer is proposed, which can reduce the cost of computing. The framework is evaluated based on three widely used video action datasets, and the proposed approach performs a more comparable performance with the state-of-the-art methods.

Ground-to-Aerial Person Search: Benchmark Dataset and Approach

Preprint

Aug 2023

In this work, we construct a large-scale dataset for Ground-to-Aerial Person Search, named G2APS, which contains 31,770 images of 260,559 annotated bounding boxes for 2,644 identities appearing in both of the UAVs and ground surveillance cameras. To our knowledge, this is the first dataset for cross-platform intelligent surveillance applications, where the UAVs could work as a powerful complement for the ground surveillance cameras. To more realistically simulate the actual cross-platform Ground-to-Aerial surveillance scenarios, the surveillance cameras are fixed about 2 meters above the ground, while the UAVs capture videos of persons at different location, with a variety of view-angles, flight attitudes and flight modes. Therefore, the dataset has the following unique characteristics: 1) drastic view-angle changes between query and gallery person images from cross-platform cameras; 2) diverse resolutions, poses and views of the person images under 9 rich real-world scenarios. On basis of the G2APS benchmark dataset, we demonstrate detailed analysis about current two-step and end-to-end person search methods, and further propose a simple yet effective knowledge distillation scheme on the head of the ReID network, which achieves state-of-the-art performances on both of the G2APS and the previous two public person search datasets, i.e., PRW and CUHK-SYSU. The dataset and source code available on \url{https://github.com/yqc123456/HKD_for_person_search}.

Camshift Algorithm with GOA-Neural Network for Drone Object Tracking

Article

Full-text available

Apr 2023

A New UAV-Based Social Distance Detector for COVID-19 Outbreaks Reduction, Using IoT, Computer Vision and Deep Learning Technologies

Article

Dec 2022

Nowadays, we are living in a dangerous environment and our health system is under the threatened causes of Covid19 and other diseases. The people who are close together are more threatened by different viruses, especially Covid19. In addition, limiting the physical distance between people helps minimize the risk of the virus spreading. For this reason, we created a smart system to detect violated social distance in public areas as markets and streets. In the proposed system, the algorithm for people detection uses a pre-existing deep learning model and computer vision techniques to determine the distances between humans. The detection model uses bounding box information to identify persons. The identified bounding box centroid's pairwise distances of people are calculated using the Euclidean distance. Also, we used jetson nano platform to implement a low-cost embedded system and IoT techniques to send the images and notifications to the nearest police station to apply forfeit when it detects people’s congestion in a specific area. Lastly, the suggested system has the capability to assist decrease the intensity of the spread of COVID-19 and other diseases by identifying violated social distance measures and notifying the owner of the system. Using the transformation matrix and accurate pedestrian detection, the process of detecting social distances between individuals may be achieved great confidence. Experiments show that CNN-based object detectors with our suggested social distancing algorithm provide reasonable accuracy for monitoring social distancing in public places, as well.

A Multimodal Information Fusion Model for Robot Action Recognition with Time Series

Article

Full-text available

Jun 2022

The current robotics field, led by a new generation of information technology, is moving into a new stage of human-machine collaborative operation. Unlike traditional robots that need to use isolation rails to maintain a certain safety distance from people, the new generation of human-machine collaboration systems can work side by side with humans without spatial obstruction, giving full play to the expertise of people and machines through an intelligent assignment of operational tasks and improving work patterns to achieve increased efficiency. The robot’s efficient and accurate recognition of human movements has become a key factor in measuring robot performance. Usually, the data for action recognition is video data, and video data is time-series data. Time series describe the response results of a certain system at different times. Therefore, the study of time series can be used to recognize the structural characteristics of the system and reveal its operation law. As a result, this paper proposes a time series-based action recognition model with multimodal information fusion and applies it to a robot to realize friendly human-robot interaction. Multifeatures can characterize data information comprehensively, and in this study, the spatial flow and motion flow features of the dataset are extracted separately, and each feature is input into a bidirectional long and short-term memory network (BiLSTM). A confidence fusion method was used to obtain the final action recognition results. Experiment results on the publicly available datasets NTU-RGB + D and MSR Action 3D show that the method proposed in this paper can improve action recognition accuracy.

A New Efficient-Attention Based Disaster Classification for Emergency Monitoring

Conference Paper

Feb 2024

Ground-to-Aerial Person Search: Benchmark Dataset and Approach

Conference Paper

Oct 2023

Real-Time Human Detection and Gesture Recognition for On-Board UAV Rescue

Article

Full-text available

Mar 2021
SENSORS-BASEL

Unmanned aerial vehicles (UAVs) play an important role in numerous technical and scientific fields, especially in wilderness rescue. This paper carries out work on real-time UAV human detection and recognition of body and hand rescue gestures. We use body-featuring solutions to establish biometric communications, like yolo3-tiny for human detection. When the presence of a person is detected, the system will enter the gesture recognition phase, where the user and the drone can communicate briefly and effectively, avoiding the drawbacks of speech communication. A data-set of ten body rescue gestures (i.e., Kick, Punch, Squat, Stand, Attention, Cancel, Walk, Sit, Direction, and PhoneCall) has been created by a UAV on-board camera. The two most important gestures are the novel dynamic Attention and Cancel which represent the set and reset functions respectively. When the rescue gesture of the human body is recognized as Attention, the drone will gradually approach the user with a larger resolution for hand gesture recognition. The system achieves 99.80% accuracy on testing data in body gesture data-set and 94.71% accuracy on testing data in hand gesture data-set by using the deep learning method. Experiments conducted on real-time UAV cameras confirm our solution can achieve our expected UAV rescue purpose.

A New Embedded Surveillance System for Reducing COVID-19 Outbreak in Elderly Based on Deep Learning and IoT

Conference Paper

Full-text available

Oct 2020

Deep learning and handcrafted features for one-class anomaly detection in UAV video

Article

Full-text available

Jan 2021
MULTIMED TOOLS APPL

Visual surveillance systems have recently captured the attention of the research community. Most of the proposed surveillance systems deal with stationary cameras. Nevertheless, these systems may reflect minor applicability in anomaly detection when multiple cameras are required. Lately, under technological progress in electronic and avionics systems, Unmanned Aerial Vehicles (UAVs) are increasingly used in a wide variety of urban missions. Especially, in the surveillance context, UAVs can be used as mobile cameras to overcome weaknesses of stationary cameras. One of the principal advantages that makes UAVs attractive is their ability to provide a new aerial perspective. Despite their numerous advantages, there are many difficulties associated with automatic anomalies detection by an UAV, as there is a lack in the proposed contributions describing anomaly detection in videos recorded by a drone. In this paper, we propose new anomaly detection techniques for assisting UAV based surveillance mission where videos are acquired by a mobile camera. To extract robust features from UAV videos, three different features extraction methods were used, namely a pretrained Convolutional Neural Network (CNN) and two popular handcrafted methods (Histogram of Oriented Gradient (HOG) and HOG3D). One Class Support Vector Machine (OCSVM) has been then applied for the unsupervised classification. Extensive experiments carried on a dataset containing videos taken by an UAV monitoring a car parking, prove the efficiency of the proposed techniques. Specifically, the quantitative results obtained using the challenging Area Under Curve (AUC) evaluation metric show that, despite the variation among them, the proposed methods achieve good results in comparison to the existing technique with an AUC = 0.78 at worst and an AUC = 0.93 at best.

Drone-Action: An Outdoor Recorded Drone Video Dataset for Action Recognition

Article

Full-text available

Nov 2019

Aerial human action recognition is an emerging topic in drone applications. Commercial drone platforms capable of detecting basic human actions such as hand gestures have been developed. However, a limited number of aerial video datasets are available to support increased research into aerial human action analysis. Most of the datasets are confined to indoor scenes or object tracking and many outdoor datasets do not have sufficient human body details to apply state-of-the-art machine learning techniques. To fill this gap and enable research in wider application areas, we present an action recognition dataset recorded in an outdoor setting. A free flying drone was used to record 13 dynamic human actions. The dataset contains 240 high-definition video clips consisting of 66,919 frames. All of the videos were recorded from low-altitude and at low speed to capture the maximum human pose details with relatively high resolution. This dataset should be useful to many research areas, including action recognition, surveillance, situational awareness, and gait analysis. To test the dataset, we evaluated the dataset with a pose-based convolutional neural network (P-CNN) and high-level pose feature (HLPF) descriptors. The overall baseline action recognition accuracy calculated using P-CNN was 75.92%.

Action recognition in freestyle wrestling using silhouette-skeleton features

Article

Full-text available

Nov 2019

Despite many advances made in Human Action Recognition (HAR), there are still challenges encouraging researchers to explore new methods. In this study, a new feature descriptor based on the silhouette skeleton called Histogram of Graph Nodes (HGN) is proposed. Unlike similar methods, which are strictly based on the articulated human body model, we extracted discriminative features solely using the foreground silhouettes. To this purpose, first, the skeletons of the silhouettes are converted into a graph, representing approximately articulated human body skeleton. By partitioning the region of the graph, the HGN is calculated in each frame. After that, we obtain the final feature vector by combining the HGNs in time. On the other hand, the recognition of two-person sports techniques is one of the areas that has not received adequate attention. To this end, we investigate the recognition of techniques in wrestling as a new computer vision application. In this regard, a dataset of the Freestyle Wrestling techniques (FSW) is introduced. We conducted extensive experiments using the proposed method on the provided dataset. In addition, we examined the proposed feature descriptor on the SBU and THETIS datasets, and the MHI-based features on the FSW dataset. We achieved 84.9% accuracy on FSW dataset while the results are 90.8% for SBU and 44% for THETIS datasets. The fact that experimental results are superior or comparable to other similar methods indicates the effectiveness of the proposed approach.

A Smart School by Using an Embedded Deep Learning Approach for Preventing Fake Attendance

Conference Paper

Full-text available

Sep 2019

Human action recognition in drone videos using a few aerial training examples

Article

Feb 2021

Drones are enabling new forms of human actions surveillance due to their low cost and fast mobility. However, using deep neural networks for automatic aerial action recognition is difficult due to the need for a large number of training aerial human action videos. Collecting a large number of human action aerial videos is costly, time-consuming, and difficult. In this paper, we explore two alternative data sources to improve aerial action classification when only a few training aerial examples are available. As a first data source, we resort to video games. We collect plenty of aerial game action videos using two gaming engines. For the second data source, we leverage conditional Wasserstein Generative Adversarial Networks to generate aerial features from ground videos. Given that both data sources have some limitations, e.g. game videos are biased towards specific actions categories (fighting, shooting, etc.,), and it is not easy to generate good discriminative GAN-generated features for all types of actions, we need to efficiently integrate two dataset sources with few available real aerial training videos. To address this challenge of the heterogeneous nature of the data, we propose to use a disjoint multitask learning framework. We feed the network with real and game, or real and GAN-generated data in an alternating fashion to obtain an improved action classifier. We validate the proposed approach on two aerial action datasets and demonstrate that features from aerial game videos and those generated from GAN can be extremely useful for an improved action recognition in real aerial videos when only a few real aerial training examples are available.

Security Analysis of Drones Systems: Attacks, Limitations, and Recommendations

Article

May 2020

Recently, the world witnessed a significant increase in the number of used drones, with a global and continuous rise in the demand for their multi-purpose applications. The pervasive aspect of these drones is due to their ability to answer people’s needs. Drones are providing users with a bird’s eye that can be activated and used almost anywhere and at any time. However, recently, the malicious use of drones began to emerge among criminals and cyber-criminals alike. The probability and frequency of these attacks are both high and their impact can be very dangerous with devastating effects. Therefore, the need for detective, protective and preventive counter-measures is highly required. The aim of this survey is to investigate the emerging threats of using drones in cyber-attacks, along the countermeasures to thwart these attacks. The different uses of drones for malicious purposes are also reviewed, along the possible detection methods. As such, this paper analyzes the exploitation of drones vulnerabilities within communication links, as well as smart devices and hardware, including smart-phones and tablets. Moreover, this paper presents a detailed review on the drone/Unmanned Aerial Vehicle (UAV) usage in multiple domains (i.e civilian, military, terrorism, etc.) and for different purposes. A realistic attack scenario is also presented, which details how the authors performed a simulated attack on a given drone following the hacking cycle. This review would greatly help ethical hackers to understand the existing vulnerabilities of UAVs in both military and civilian domains. Moreover, it allows them to adopt and come up with new techniques and technologies for enhanced UAV attack detection and protection. As a result, various civilian and military anti-drones/UAVs (detective and preventive) countermeasures will be reviewed.

Human activity recognition from UAV-captured video sequences

Article

Nov 2019
PATTERN RECOGN

This research paper introduces a new approach for human activity recognition from UAV-captured video sequences. The proposed approach involves two phases: an offline phase and an inference phase. A scene stabilization step is performed together with these two phases. The offline phase aims to generate the human/non-human model as well as a human activity model using a convolutional neural network. The inference phase makes use of the already generated models in order to detect humans and recognize their activities. Our main contribution lies in adapting the convolutional neural networks, normally dedicated to the classification task, to detect humans. In addition, the classification of human activities is carried out according to two scenarios: An instant classification of video frames and an entire classification of the video sequences. Relying on an experimental evaluation of the proposed methods for human detection and human activity classification on the UCF-ARG dataset, we validated not only these contributions but also the performance of our methods compared to the existing ones.

OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields

Article

Jul 2019

Realtime multi-person 2D pose estimation is a key component in enabling machines to have an understanding of people in images and videos. In this work, we present a realtime approach to detect the 2D pose of multiple people in an image. The proposed method uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and realtime performance, regardless of the number of people in the image. In previous work, PAFs and body part location estimation were refined simultaneously across training stages. We demonstrate that using a PAF-only refinement is able to achieve a substantial increase in both runtime performance and accuracy. We also present the first combined body and foot keypoint detector, based on an annotated foot dataset that we have publicly released. We show that the combined detector not only reduces the inference time compared to running them sequentially, but also maintains the accuracy of each component individually. This work has culminated in the release of OpenPose, the first open-source realtime system for multi-person 2D pose detection, including body, foot, hand, and facial keypoints.

Challenges and Limitations in Human Action Recognition on Unmanned Aerial Vehicles: A Comprehensive Survey

Abstract and Figures

Recommended publications

Overview of Quad Copter and Its Utilitarian

Development of a Novel Lightweight CNN Model for Classification of Human Actions in UAV-Captured Vid...

A New UAV-Based Social Distance Detector for COVID-19 Outbreaks Reduction, Using IoT, Computer Visio...

A New Embedded Surveillance System for Reducing COVID-19 Outbreak in Elderly Based on Deep Learning...

A Low-Cost Embedded Security System for UAV-Based Face Mask Detector Using IoT and Deep Learning to...