ArticlePDF Available

Semantic Segmentation Based Crowd Tracking and Anomaly Detection via Neuro-fuzzy Classifier in Smart Surveillance System

August 2022
Arabian Journal for Science and Engineering 48(2)

August 2022
48(2)

DOI:10.1007/s13369-022-07092-x

Authors:

Faisal Abdullah

Air University of Islamabad

Ahmad Jalal

Pohang University of Science and Technology

Crowd tracking and analysis of crowd behavior is a challenging research area in computer vision. In today’s crowded environment manual surveillance systems are inefficient, labor-intensive, and unwieldy. Automated video surveillance systems offer promising solutions to these problems and hence become a need for today’s environment. However, challenges remain. The most important challenge is the extraction of foreground representing human pixels only, also the extraction of robust spatial and temporal descriptors along with potent classifier is an essential part for accurate behavior detection. In this paper, we present our approach to these challenges by inventing semantic segmentation for foreground extraction. Furthermore, for pedestrians counting and tracking we introduced a fusion of human motion analysis and attraction force model by the weighted averaging method that removes non-humans and non-pedestrians from the scene. The verified pedestrians are counted using a fuzzy-c-mean algorithm and tracked via Hungarian algorithm association along with dynamic template matching technique. However, for anomaly detection after silhouettes extraction, we invent robust Spatio-temporal descriptors including crowd shape deformation, silhouette slicing, particles convection, dominant motion, and energy descriptors. That we optimized using an adaptive genetic algorithm and finally, multi-fused optimal features are fed to a multilayer neuro-fuzzy classifier for decision making. The proposed system is validated via extensive experimentations and achieved an accuracy of 91.8% and 89.16% over UCSD and Mall datasets for crowd tracking. However, the mean absolute error and mean square error for pedestrian counting are 1.69 and 2.09 over UCSD dataset and 2.57 and 4.34 for Mall dataset, respectively. An accuracy of 96.5% and 94% is achieved over UMN and MED datasets for anomaly detection.

Content uploaded by Faisal Abdullah

Content may be subject to copyright.

Arabian Journal for Science and Engineering

https://doi.org/10.1007/s13369-022-07092-x

RESEARCH ARTICLE-COMPUTER ENGINEERING AND COMPUTER SCIENCE

Semantic Segmentation Based Crowd Tracking and Anomaly

Detection via Neuro-fuzzy Classiﬁer in Smart Surveillance System

Faisal Abdullah1

·Ahmad Jalal1

Received: 25 October 2021 / Accepted: 22 June 2022

Abstract

Crowd tracking and analysis of crowd behavior is a challenging research area in computer vision. In today’s crowded

environment manual surveillance systems are inefﬁcient, labor-intensive, and unwieldy. Automated video surveillance systems

offer promising solutions to these problems and hence become a need for today’s environment. However, challenges remain.

The most important challenge is the extraction of foreground representing human pixels only, also the extraction of robust

spatial and temporal descriptors along with potent classiﬁer is an essential part for accurate behavior detection. In this paper,

we present our approach to these challenges by inventing semantic segmentation for foreground extraction. Furthermore, for

pedestrians counting and tracking we introduced a fusion of human motion analysis and attraction force model by the weighted

averaging method that removes non-humans and non-pedestrians from the scene. The veriﬁed pedestrians are counted using a

fuzzy-c-mean algorithm and tracked via Hungarian algorithm association along with dynamic template matching technique.

However, for anomaly detection after silhouettes extraction, we invent robust Spatio-temporal descriptors including crowd

shape deformation, silhouette slicing, particles convection, dominant motion, and energy descriptors. That we optimized

using an adaptive genetic algorithm and ﬁnally, multi-fused optimal features are fed to a multilayer neuro-fuzzy classiﬁer

for decision making. The proposed system is validated via extensive experimentations and achieved an accuracy of 91.8%

and 89.16% over UCSD and Mall datasets for crowd tracking. However, the mean absolute error and mean square error for

pedestrian counting are 1.69 and 2.09 over UCSD dataset and 2.57 and 4.34 for Mall dataset, respectively. An accuracy of

96.5% and 94% is achieved over UMN and MED datasets for anomaly detection.

Keywords Attraction force model ·Crowd shape deformation ·Multilayer neuro-fuzzy classiﬁer ·Semantic segmentation ·

Time-domain descriptors ·Tracking and anomaly detection

1 Introduction

Automatic video surveillance is contemplated as the ﬁrst

step in various artiﬁcial intelligence applications [1,2] that

are developed to track human crowds and to analyze crowd

behavior [3,4]. As automated surveillance systems rapidly

detect unusual and critical situations in a crowded environ-

ment that would absolutely assist to make adequate decisions

for safety and emergency control [5,6]. Hence, surveillance

systems are essential in complex and crowded environments

BAhmad Jalal

ahmjal@yahoo.com

Faisal Abdullah

191633@students.au.edu.pk

1Department of Computer Science, Air University, E-9,

Islamabad 44000, Pakistan

like busy streets, political rallies, airports, train stations,

and shopping malls, to automatically detect and control the

escape or panic behavior causes due to riots or chaotic acts,

stampedes, pushing, and violent events for public safety,

security, and statistical purposes [7,8].

To supervise, protect and control the crowd, density esti-

mation and tracking is a crucial video-frame analyzing

process as it provides basic descriptions of crowd status [9,

10]. However, counting and tracking in crowded scenes is

a challenging problem because of instantaneous illumina-

tion changes, different outlooks and behaviors, partial or

full occlusions, complicated background, indoor and out-

door scenes, and as the crowd increases the pixels per

human decreases [11,12]. On the other hand, challenges

faced in crowd behavior detection involve, low resolution

with dynamic background, modeling the crowd behavior,

occlusion between individuals, random variations of a crowd

123

Arabian Journal for Science and Engineering

[13,14]. Hence, accurate crowd behavior detection with the

diversity of video scenes required the extraction of robust

descriptors that provide signiﬁcant information about motion

and scene changes and a strong decision-making classiﬁer

[15,16].

In this research article, we have proposed a new robust

approach for pedestrian crowd density estimation, tracking,

and anomaly detection. We begin with pre-processing steps.

Then for foreground extraction, we introduced semantic

segmentation by labeling and clustering the pixels belong-

ing to the same class. After foreground extraction, our work

involved two facets, (i) pedestrian crowd counting and track-

ing, (ii) crowd anomaly detection. For pedestrians counting

and tracking we ﬁrst, veriﬁed the extracted silhouettes by

introducing a weighted averaging fusion of HMA and AFM

and exclude the non-humans and non-pedestrians from the

scene. After that, we used a fuzzy-c-mean algorithm to count

the pedestrians. We used the Hungarian algorithm for associ-

ation and dynamic template matching for tracking. However,

for anomaly detection, we ﬁrst extract new robust Spatio-

temporal descriptors including crowd shape deformation,

dominant motion, silhouette slicing, energy, and particle con-

vection descriptors. That we optimized using an adaptive

genetic algorithm. Lastly, optimized multi-fused distinguish-

able descriptors are passed through a multilayer neuro-fuzzy

classiﬁer for anomaly detection.

The major contributions and highlights presented in this

paper are summarized as follows.

1. We proposed a robust semantic segmentation approach

for foreground extraction, which is a necessary step

for crowd estimation, tracking, and behavior analysis in

crowded scenes.

2. Fusion of AFM and HMA via weighted averaging

process is introduced for removal of non-humans and

non-pedestrians from the scene.

3. A clustering method using a fuzzy-c-mean algorithm is

used for density estimation also particles-based measure

is introduced for inferring the number of pedestrians in

each cluster.

4. Multi-scale descriptors are introduced, namely, crowd

shape deformation, dominant motion, Energy-based, sil-

houette slicing, and particle convection descriptors for

anomaly detection.

5. A multilayer Neuro-Fuzzy classiﬁer is used to make the

decision based on multi-fused optimized descriptors for

anomaly detection. A comparative analysis is carried out

on four publicly available benchmark datasets: UCSD

and Mall datasets for crowd counting and tracking while

for anomaly detection UMN and MED datasets are used.

The rest of the article is arranged as follows: Sect. 2

succinctly reviews other state-of-the-art methods. Section 3

describes the methodology of our proposed system. Per-

formance evaluation of our proposed approach on four

benchmark datasets plus a comparison analysis and discus-

sion is given in Sect. 4. Finally, in Sect. 5, we conclude the

paper and outline the future directions.

2 Related Work

In recent years, different computer vision approaches have

been proposed by researchers for crowd density estimation,

tracking, and anomaly detection [17,18]. We divide the

related work into two subsections, the ﬁrst section describes

crowd density estimation and tracking systems; however, the

second subsection describes crowd anomaly detection sys-

tems.

2.1 Crowd Density Estimation and Tracking

Various researchers have employed different models to track

and estimate crowd density [19,20]. Table 1presents the

summary of research works relevant to these models.

2.2 Crowd Anomaly Detection Systems

Numerous researchers have devoted their energies in devel-

oping systems for anomaly detection using different methods

[27,28]. Table 2shows the detailed summary of research

works relevant to these models.

3 Proposed System Framework

In this paper, we introduced a robust Semantic Segmenta-

tion based pedestrian tracking and anomaly detection system.

In our proposed system, initially, we apply pre-processing

steps then we deploy Semantic Segmentation for multiple

object detection and extract human resembled silhouettes.

After that, we segregate our work into two facets. In the

ﬁrst, for crowd counting and tracking, we performed veri-

ﬁcation of extracted silhouettes by introducing a weighted

average fusion of attraction force model and human motion

analysis. Next, crowd counting is performed using a fuzzy-c-

mean algorithm and for crowd tracking, we used Hungarian

algorithm association and dynamic template matching tech-

nique. However, in the second facet for anomaly detection

after getting human silhouettes we extract Spatio-temporal

descriptors that are optimized using an adaptive genetic algo-

rithm. These optimized features are then fed to a multilayer

neuro-fuzzy classiﬁer for anomaly detection. Figure 1depicts

the synoptic schematics of our proposed system. The detail

of each of the aforementioned modules is explained in the

following subsections.

123

Arabian Journal for Science and Engineering

Table 1 Crowd density estimation and tracking systems

Authors Methodology Highlights and limitations

Zhang et al. [21] For the detection of humans, the fusion of cascade boosted

classiﬁer and rectangle features were used for training the

multi-scale head-shoulder, and then human tracking is

used to eliminate the duplicates and count the pedestrians

The system has certain misclassiﬁcations due to the

similar randomization of different classes

Khan et al. [22] The human head detection and localization method were

used for crowd counting. Scale map-based scale-aware

head proposals were generated that passed to the CNN for

head probabilities, which are then conﬁrmed and added to

the count using non-maximal suppression

The author used non-maximal suppression with a ﬁxed

threshold to get the precise location of heads, which

also affects the performance of localization when

threshold value changes, which limits the system

performance

Gochoo et al. [23] Developed Hough circular gradient transforms for head

detection and HOG-based symmetry technique for

shoulder detection, the detected heads are veriﬁed by 1D

CNN, which are then counted using cross-line judgment

technique

The system accuracy is limited with illumination changes

and in complex crowded scenes, especially in the queue

condition, the system produced some misdetection of

heads due to overlaps

Pervaiz et al. [24] The template matching method was used for human

veriﬁcation. Then people counting was performed by

distributing multiple particles on humans for extraction of

particle ﬂows that are clustered by the self-organizing map

The model accuracy decreases in the dense crowd as the

model was not able to detect human silhouettes that are

partially or fully occluded by other objects for an

extended period of time

Merad et al. [25] This research work is based on the association of two

modules, the ﬁrst module is the tracking module and the

second module is the association module to recover the

global trajectories of tracked individuals in a multiple

target tracking system

The system was not effective for arbitrary movements

and overlaps, also they used the k-nearest neighbor

algorithm for re-identiﬁcation strategy and the accuracy

varies as the number of k (chosen neighbors) varies

Chahyati et al. [26] The multiple target objects are detected using RetinaNet

and then the Hungarian algorithm is used for tracking

The detection accuracy of the system degrades in a

complex crowd, which limits the system accuracy

3.1 Pre-Processing

During pre-processing, videos from a static video camera are

ﬁrst converted into color frames [ f1,f2,f3,...,fN] where

Nis the total number of frames. Each colored frame is then

passed through an Adaptive Median Filter (AMF) to effec-

tively exclude noise, distortion and to provide smoothing by

protecting edges. AMF works in two stages and compares

each pixel in the image to its neighbor pixels and classiﬁes

pixels as noise by performing spatial processing. In AMF,

those pixels are labeled as impulse noise that is not struc-

turally aligned with pixels to which they are similar, as well

as different from a majority of their neighbors. The threshold

for the comparison as well as the size of the neighborhood is

adjustable. In AMF, noisy pixels are replaced by the median

pixel value of the pixels in the neighborhood that have passed

the noise labeling test. After AMF histogram equalization

was performed on the ﬁltered image to adjust the contrast of

an image using Eq. (1).

skT(rk)(L−1)



j0

prrj(1)

where k0, 1, 2, …, (L−1), variable rdenotes the intensi-

ties of an input image to be processed. As usual, we assumed

that ris in the range [0 −(L−1)], with r0 representing

black and rL−1 representing white, while srepresents

the output intensity level after intensity mapping for every

pixel in the input image, having intensity r. However, pr(r)is

the probability density function (PDF) of r, where the sub-

script on pwas used to indicate that it was a PDF of r. Thus,

a processed (output) image was achieved using Eq. 1,by

mapping each pixel in the input image with intensity rkinto

a corresponding pixel with level skin the output image, as

shown in Fig. 2.

3.2 Semantic Segmentation

After pre-processing phase, for multi-object detection, we

deploy semantic segmentation [35–38]. As we know an

image is a collection of pixels, thus in semantic segmenta-

tion (SS) we classify every pixel of an image into a particular

label class, resulting in an image segmented by class. SS

is used to recognize a collection of pixels that form dis-

tinct categories. We apply a deep learning algorithm for

semantic segmentation using an encoder-decoder structure.

Our encoder-decoder network consists of an encoder module

that gradually reduces the feature maps and captures higher

semantic information and a decoder module that reﬁnes the

segmentation results along object boundaries. Figure 3shows

our encoder-decoder structure with atrous convolution for

123

Arabian Journal for Science and Engineering

Table 2 Crowd anomaly detection system

Authors Methodology Highlights and limitations

Zhang et al. [29] Anomalies are detected via ﬂuid force model and scene

perception. Fluid features and appearance features are

extracted and a ﬁnal decision was taken by one class

SVM on the bases of extracted features

The proposed model was not effective in different

scenes, hence adaptability of the method in different

cases are needed to be further improved

Karim et al. [30] Anomaly is detected on the bases of outlier rejection.

Superpixels whose direction of motion doesn’t conform

with the dominant motion are rejected, and extracted

features with a modiﬁed k-means classiﬁcation

algorithm were used for anomaly detection

The proposed hand-crafted features produced

false-positive results in complex crowds. However, if

the hand-crafted features will be fused with a deep

neural network then further performance improvement

is expected

Shehzad et al. [31] This research work used the Jaccard similarity index and

template matching technique for multi-people tracking.

However, gaussian clusters were introduced for

abnormal event detection

The system detects humans via thresholding technique.

The use of thresholding in the detection module

degrades the system accuracy in occlusions and

illumination variation conditions

Yimin et al. [32] This research work detect the anomalies based on the

optical ﬂow trajectory of joint points for every human

body, features are extracted making use of the trajectory

constraints, and the ﬁnal decision is taken by SVM

The system accuracy will be seriously reduced in a

large-scale crowd with inevitable overlaps and

occlusions as it is difﬁcult to obtain an accurate

trajectory in a complex crowd

Nawaratne et al. [33] Developed an incremental Spatio-temporal learner model

by utilizing active learning, that temporally updates on

anomalies using convolution layer that learn spatial

regularities and ConvLSTM layers that learn temporal

regularities

The system may produce some false negative detections

in case of re-occurring anomalies. Also, the system

required a large dataset and time for training

Chen et al. [34] A motion energy model was introduced that detects

anomalies by considering the sum of square differences

metric of motion information in the center and its

neighboring blocks based on the preset threshold

The use of ﬁxed block size for calculating the motion

energy value limits the system accuracy, also the

anomaly is detected if the motion energy value is

beyond the preset threshold which is not effective in all

cases

semantic segmentation. We used the atrous spatial pyramid

pooling module as an encoder that applies atrous convolution

with different rates to probe convolutional features at multi-

ple scales. Atrous convolution allows us to extract features

computed by DCNN at an arbitrary resolution and adjust ﬁlter

ﬁeld-of-view to capture multi-scale contextual information at

multiple scales. At the decoder end, we ﬁrst unsampled the

features bilinearly by a factor of 4, and then low-level fea-

tures are concatenated. Furthermore, for the sake of smooth

training and increasing the importance of encoder features,

we reduce channels by including 1 ×1 convolution on the

low-level features Finally, we again bilinearly unsampled by

a factor of 4 after reﬁning the features by 3 ×3 convolu-

tion. Figure 4. Shows results for semantic segmentation for

different random views.

3.3 Silhouettes Extraction

After multi-object detection through semantic segmentation,

we extract only those pixels that belong to the human class

as every pixel is labeled and assigned to its particular class in

SS. All the pixels other than the human class are set to zero, as

we are interested in human silhouettes only. After extracting

the pixels of the human class, we convert the image into a

binary image using Eq. (2).

bw(x,y)1ifI(x,y)>0

0ifI(x,y)≤0(2)

where Iis the image with only human class pixels and bw is

the resultant binary image containing only human silhouettes

as shown in Fig. 5b.

3.4 Crowd Density Estimation and Tracking

Authentic pedestrians crowd tracking system requires extrac-

tion of true foreground, representing human pedestrians only.

Hence, for scrupulous pedestrian tracking after semantic seg-

mentation, we performed the human silhouettes veriﬁcation

step and then pedestrian counting and tracking steps are exe-

cuted.

3.4.1 Pedestrian Human Silhouettes Veriﬁcation

For human silhouettes veriﬁcation, we introduced a robust

Human Motion Analysis (HMA) and Attraction Force Model

123

Arabian Journal for Science and Engineering

Descriptors Extraction

eDt

artx

liS

Pre- gn

Silhouettes Verification

Pedestrian Counting

AGA Optimization

Crowd Shape

Deformation

Descriptor

Pedestrian Tracking

Anomaly Detection

Tracking Frames

Anomaly Frames

Particles

Convection

Descriptor

Energy

Descriptor

Dominant

Motion

Descriptor

Silhouettes

Slicing

Descriptor

semarFtupnI

Fig. 1 Synoptic schematics of the proposed pedestrian tracking and crowd anomaly detection system

Fig. 2 Pre-processing steps. aFiltered image using AMF, bhistogram

of ﬁltered image, chistogram of the enhanced image, and denhanced

image

(AFM). We eliminate all the objects other than human pedes-

trians, using a fusion of HMA and AFM via the weighted

averaging method for accurate and strict pedestrian tracking.

Decoder

Atrous

Convolution

DCNN

1x1 conv

3x3 conv

Image

Pooling

1x1 conv

Concat

Unsample

by 4

3x3 conv

Unsample

by 4

Input

Frames

Prediction

Encoder

Low level

Features

Fig. 3 Encoder–decoder structure with Atrous convolution for semantic

segmentation

In HMA, to distinguish pedestrians from other objects,

e.g. bicyclists, bikes, etc. we determined the internal motion

of a moving silhouette over time using the strategy of star

skeleton. For producing the star skeleton on extracted sil-

houettes, we ﬁrst ﬁnd the centroid and then connect by line

the center of each silhouette to three extremal points that are

recovered by traversing the boundaries. The three extremal

123

Arabian Journal for Science and Engineering

Fig. 4 Results of semantic segmentation. a,bover UCSD pedestrian

dataset, and cover UMN dataset

Fig. 5 Foreground extraction. aHuman class pixels only, bbinary view

of extracted human class

Fig. 6 Human motion analysis, determining of angles from a skeleton,

where xcand ycare the spatial locations of the center of the skeleton,

corresponds to the hip position on vertical and horizontal directions,

respectively

points usually represent the torso and two legs assuming the

uppermost and lowermost extremal points, as normally the

human motion is in an upright position. Since pedestrian

people exhibit periodic motion on moving; however, other

objects, i.e., bicyclists, exhibit rotational motion on moving.

Hence to analyze the motion, we measure the angle αamong

the vertical and uppermost extremal point, angle βamong

the lowermost two extremal points, and angle γamong the

end locations of two extremal points for measuring the vari-

ation of movement over time in 2D space corresponding to

the ankles as depicted in Fig. 6.

Hence, we distinguished the pedestrians from other

objects by analyzing the human motion via measuring the

variation of three angles in time as depicted in Fig. 7.

(a) (b)

Fig. 7 Human motion analysis. aStar-skeleton projected onto the sil-

houettes over an extracted frame, bmagniﬁed view of motion analysis

(a) (b) (c)

Fig. 8 Attraction force model. aParticle conversion of all extracted

silhouettes, bmagniﬁed view of particle conversion

(a) (b) (c)

Fig. 9 Attraction force model. aAttraction force between two particles,

bmagniﬁed view of AF for pedestrian, cmagniﬁed view of AF for non-

pedestrian

In AFM, ﬁrst, we convert each extracted silhouette into

particles such that every silhouette was represented by a col-

lection of particles R[p1,p2,p3,…,pz], where Zrepresents

the total number of particles in one silhouette as shown in

Fig. 8. From the Physics concept, we know that.

From the Physics concept, we know that in solids, because

of less kinetic energy the particles won’t be able to overcome

the strong force of attraction, called bonds, which attracts the

particles toward each other. Using this same concept, we treat

every extracted pixel as a ﬂuid particle and calculate the force

of attraction between particles of each extracted silhouette.

To reduce the computational complexity, we calculate the

internal force of attraction between two mutually interacting

particles using Eq. (3), as shown in Fig. 9.

123

Arabian Journal for Science and Engineering

Fip1p2

r2(3)

where Fiis the attraction force among p1and p2of the ith

silhouette where iis in the range [1 E] with E, depicting the

maximum number of silhouettes per frame, and r2represents

the distance square among p1and p2.

After attraction force calculation we discard silhouettes

having static attraction force in a sequence of frames by uti-

lizing Eq. (4).

Hs1if

dFi

dt>0

0 otherwise (4)

where dFi

dtis the change in the force of attraction with respect

to time among particles of every ith silhouette between frame

tto t+ 1. We also eliminate objects whose attraction force

is beyond a certain threshold by considering them as non-

pedestrians, i.e., bicyclists and bikes etc.

As a summary, the process in this section is to verify

extracted silhouettes obtained through SS using a fusion

of AFM and HMA via weighted averaging process and

eliminate non-humans and non-pedestrians to distinguish

pedestrians from other objects for accurately counting and

tracking the pedestrians in crowded scenes.

3.4.2 People Counting

After human pedestrian veriﬁcation, we performed cluster

estimation to count these detected human silhouettes using

a fuzzy-c-mean algorithm. As each silhouette consists of a

collection of particles making a cluster, we ﬁrst labeled these

clusters in each frame using Eq. (5).

LcImpk(5)

where Imis the label of cluster mand pkis the total count of

particles in one cluster, however Lcis the aftermath extracted

labeled cluster that was considered as one silhouette and

mediated in counting. We also draw bounding boxes around

every cluster to make them visually appear, hence using

labeling and cluster estimation we count all veriﬁed human

pedestrians, as depicted in Fig. 10. Since the sum of clusters

in each frame varies from frame to frame also, the sum of

particles in each cluster varies from cluster to cluster.

Inferring Number of People in Each Cluster The optimal

result of the clustering process mentioned above depicts each

pedestrian in the scene with one perspicuous cluster, hence

the count of pedestrians being equal to the count of clusters.

However, in reality, this was not always the case, because of

occlusion as the pedestrians close to each other can be singly

clustered. Hence, it can be misleading by catching the clus-

ter count itself. Thus, we proposed a particle-based measure,

Fig. 10 Pedestrian crowd counting results over different time intervals

Fig. 11 Pedestrian crowd counting results, by inferring the number of

pedestrians in each cluster on (s) Mall dataset, and bUCSD dataset

since we have already converted each extracted silhouette

into particles. In practice, a single pedestrian usually con-

sists of a speciﬁc number of particles. We ﬁrst measure the

minimum number of particles required for a single pedes-

trian and then, we propose to use the following Eq. (6)for

inferring the number of pedestrians in each cluster.

Hkpk

´pn∈k

(6)

where Hkrepresents the total number of pedestrians in cluster

k, while pkis the total number of particles in cluster kand

´pn∈kis the average number of particles required for a single

pedestrian. Figure 11 shows the pedestrian counts when there

is occlusion, and two humans present in one cluster.

3.4.3 Pedestrian Association

After pedestrian counting, our next goal is to track these

pedestrians. For this purpose, we used Hungarian Algorithm

(HA) to associate people from one frame to another based

onascore[39,40]. This will maintain the identity of each

pedestrian in the scene. In our work, we deﬁne the score as

a combination of two matrices. The bounding box centers

distance, and the IoU of the bounding box. The associating

algorithm subsists of the following steps.

•Established two empty lists, one for tracking (t-1) and a

detection list t.

123

Arabian Journal for Science and Engineering

•Calculate centers distance and IoU score and store it in

a matrix by going through detection and tracking list, we

also used cost function for prioritizing each score.

•After that, we run the Hungarian algorithm. This algorithm

searches for the minimum tracking value for each detection

in a matrix using bipartite graph logic, and thus we have

a matrix that depicts the matching among detection and

tracking.

•In case of complex occlusion where bounding boxes over-

lap, having two or more matches for a single candidate,

we set max IoU to 1 and all reaming to 0. Also, as we have

a score not cost thus, we replace our 1 with −1 for easy

search of minimum value.

•Missing values in our Hungarian matrix are unmatched

detection and unmatched trackers.

•Unmatched trackers have a life of 5 frames ahead for again

association else they are removed. For unmatched detec-

tions, we initialized a new tracker for 3 frames if the new

tracker remains after that, then it will active otherwise

removed.

The result of pedestrian association is a set of trackers

each associated with detection, which will become input for

the tracking stage.

3.4.4 Pedestrian Tracking

The goal of our pedestrian tracking model is to establish

the trajectories of all pedestrians in a scene through associa-

tion and position matching in each frame of the video using

a dynamic template matching approach. The dilemma is to

search if and where there is an occurrence or at least a similar

enough occurrence of the template in the target image. For

this purpose, we have used a correlation coefﬁcient as a mea-

sure of similarity between the reference (template) for each

location (x, y) in the target image. The result will be max-

imum for locations where the template has correspondence

(pixel by pixel) to the sub-image located at (x, y).

Our tracking module gets position information from the

detection module and associate pedestrian using HA. This

knowledge is then utilized to extract a reference image when-

ever required from the previous frame, and the module ﬂashes

the same rectangle over the pedestrian if it is found during

searching in the frames captured from that point. Other-

wise, a new template has been generated. Since the templates

for searching are generated dynamically exhibits a dynamic

template matching algorithm. Hence, using data association,

predicted and detected pedestrians are associated and tracked

using normalized cross-correlation as a cost function and

dynamic template matching. In our tracking approach, we

represent each pedestrian as a rectangle along with their cor-

responding Id in the bottom right as depicted in Fig. 12.

Fig. 12 Sample frames of crowd tracking result at different time inter-

vals over UCSD dataset

3.5 Anomaly Detection

Categorization of interaction within crowds into normal

and abnormal requires extraction of robust Spatio-temporal

descriptors along with a strong classiﬁer. Hence, for accurate

anomaly detection, after applying semantic segmentation

(mentioned in subsection 3.3) the extracted silhouettes are

passed through the descriptors extraction step that we opti-

mized using AGA. Finally, the decision is made by the

multilayer Neuro-Fuzzy classiﬁer.

3.5.1 Descriptor Extraction

In this paper, we have introduced multiple distinguish-

able hybrid Spatio-temporal descriptors, including Crowd

shape deformation, silhouettes slicing, Particles Convection,

Energy, and Dominant motion descriptors.

In crowd shape deformation descriptors, we analyzed the

crowd shape deformation over time, for this we ﬁrst, extract

the crowd contour, by computing the centers of all pedestri-

ans present in the scene and connecting the centers of those

pedestrians who are at the corners of the frame, thus form-

ing a biggest convex hull representing the crowd contour

as depicted in Fig. 13. After crowd contour extraction we

compare the contour of two consecutive frames using the

normalized moment of contour for computing the changes

in crowd shape. Hence, we integrate over all points of the

contour, and compute the central moment, as expressed in

Eq. (7).

Cr,s



A(x,y)x−xavgry−yavgs(7)

123

Arabian Journal for Science and Engineering

Fig. 13 Crowd shape deformation descriptors, anormal frame, babnor-

mal frame

where rand srepresenting the power to which xand ytaken

in the sum, while A(x,y) denotes the intensity of pixels in

coordinate (x,y). The summation is over all of the points

in contour boundary. The above computed moment is like a

central moment except that the values of xand yare average

values. In simple rand smoment, if rand sare both equal

to 0, then the C00 moment is just the length in points of the

contour. However, to compute the normalized moment we

divide all by an appropriate power of C00 as expressed.

Nr,sCr,s

C(r+s)/2+1

(8)

where Nr,srepresents the normalized moment. This normal-

ized moment computation of contour was used to compare

two contours of consecutive frames for analyzing the varia-

tions in crowd shape.

We also introduced new robust silhouette slicing descrip-

tors in the time domain. First, we create patches (slices) of

all pedestrian silhouettes present in the scene by ﬁnding their

centers and joining the centers to two vertical extreme points

by a line segment, patches are then created in such a way that

their centers lay on the line segment pertinent to the body

of each pedestrian, thus each pedestrian consists of a set of

slices/patches as shown in Fig. 14a.

The total number of slices on each pedestrian depends

upon user choice, however, in our experimentations, we ﬁxed

the count to six because it shows good results. After slicing

we compute the motion changes over time, by ﬁrst label-

ing each pedestrian and automatically choosing the four best

slices. For the selection of slices, we compute the histogram

of each slice by converting RGB to HSI and those four slices

of each pedestrian were selected automatically whose his-

togram is more equalized as compared to others. After that,

the selected slices were matched with the corresponding can-

didate slices of the same labeled human in the next frame and

computed the motion changes. For slice matching, we take

only the I channel of the HSI histogram and used Hellinger

Fig. 14 Silhouettes slicing descriptors. aSilhouettes slicing, band

cselected histograms of random slices, dunselected histogram of a

random slice

Fig. 15 Particles convection descriptors. aPCD for all human silhou-

ettes, bmagniﬁed view of PCD for two silhouettes

distance between the candidate histogram and the model his-

togram as expressed in Eq. (9).

HDhm

i,hc

i1

√2







i1hm

i−hc

i2(9)

where hm

iand hc

irepresents the probability distribution of

model and candidate histogram, respectively. The model slice

at a frame matched to the candidate slice in the subsequent

frame presents the smallest Hellinger distance.

In Particles Convection Descriptor, we convert each sil-

houette present in the scene into particles such that every

silhouette is represented by a collection of particles, and esti-

mate the interaction force between particles. However, in the

computation of interaction force, we considered only those

particles that present at the silhouettes contour, as shown

in Fig. 15. Generally, in the crowded scenes, pedestrians

have certain goals and destinations and thus have the desired

velocity as well, calculated using bilinear interpolation of the

123

Arabian Journal for Science and Engineering

neighboring ﬂow ﬁeld vectors using Eq. (10).

j1−wjVxj,yj+wjVmeanxj,yj (10)

where Mc

jis the desired velocity, while Vxj,yjrepresent

the optical ﬂow of particle jand Vmeanxj,yjis the mean

optical ﬂow for particle jin the coordinate (xj,yj), wjis

the panic weight parameter, wj→0shows the particular

motion of pedestrian jand wj→1 depicts group motion of

pedestrian j. However, in reality, due to the presence of other

pedestrians and obstacles in a crowd the actual motion of

pedestrians would differ from their desired motion calculated

as in Eq. (11).

MjVmeanxj,yj(11)

where Mjis the actual motion of pedestrian j. As interaction

force of the particle with the environment and its neighbor-

ing particles can be calculated as the difference between the

actual velocity of the particle and its desired velocity. Hence,

utilizing the actual motion and desired motion calculated

above we compute the interaction force using Eq. (12).

IF1

Mc

j−Mj−dMj

dt(12)

where IFrepresent the resultant interaction force, while is

the relaxation parameter, and dMj

dtis the change in motion of

pedestrians over time, using this interaction force we observe

the pedestrian’s motion dynamics.

In dominant motion descriptors [41,42], we extract tra-

jectories using a set of feature point tracks expressed as.

xj

d,yj

d,dDj

strt,...,Dj

fnl,{j1, ...,N}(13)

where Nis the total count of point tracks. We used the

Lucas-Kanade tracker for tracking these feature points. Our

purpose is to cluster these point tracks into a dominant pat-

tern of motion. Some point tracks lid only a small portion

of each pedestrian’s motion. However, for dominant motion

trajectories, we used point tracks that have protracted trajec-

tories through the scene along which a hefty count of feature

point tracks exists. We cluster those point tracks that have

the same direction of motion and are spatially close to each

other. For this purpose, we used distance metrics to compare

point tracks based on the longest common subsequence. We

sample new points after every eighth frame and add them to

the tracker as in videos new pedestrians entered with time.

A new cluster is established if another track is found that

is adequately long and antithetical from any existing cluster

center. We update the cluster’s center using the least square

polynomial ﬁt if its size outstrips the ﬁxed value. Hence, to

attain the ﬁnal dominant motion path, we merged clusters if

their center’s sameness cost exceeds 50%. Steps for dominant

motion descriptors are given in Algorithm 1.

In the energy descriptor, we observe the changes in the

energy-level distribution of all silhouettes present in the

scene, for this we store the movements of human body parts

in the form of energy maps in the energy-based matrix, hav-

ing values between 0 and 8000. After the distribution of

energy values, we extract the energy index values into a 1D

array. Mathematical representation of energy distribution is

expressed in Eq. (14).

Ei



IdR(v)(14)

where Eirepresents a 1D array, vexpressed index number,

and IdRshows RGB values of v. The energy distribution

of some random frames of the UMN dataset is shown in

Fig. 16 we analyzed the variation in distribution temporally

and used a heuristic thresholding technique to identify the

scenes where energy index values exceed beyond the thresh-

old (Fig. 17).

123

Arabian Journal for Science and Engineering

Fig. 16 Energy descriptors, aNormal frame, babnormal frame

Fig. 17 AGA optimization, aNormal optimal descriptors, babnormal

optimal descriptors

3.5.2 Event Optimization: Adaptive Genetic Algorithm

After descriptor extraction, the extracted descriptors are opti-

mized using an Adaptive Genetic Algorithm (AGA) [43,44].

Descriptor vectors are exhibited to their respective genes, to

alter them into their equivalent chromosomes in such a way

that the gene is a representative of a descriptor in a chromo-

some. The set of these chromosomes is called population.

At ﬁrst, we randomly create the populations of the chro-

mosomes. After that, the chromosomes of this population

are evaluated through ﬁtness function, which apprised how

effective a chromosome is for an optimal solution. The chro-

mosomes with maximum ﬁtness value are selected and some

genetic operations (crossover and mutation) are applied on

selected chromosomes to form new ones. Crossover is only

performed according to above or average ﬁtness; theadaptive

values are computed expressed as.

Ow1Hmax −H/Hmax −Havg,H>Havg

w3,H≤Havg

(15)

where Hmax and Havg are maximum and average ﬁtness of

chromosomes in the active generation. Computed as.

Havg h

j1Hj

h,Hmax maxH1,H2,H3,...,Hf

However, for mutation, we set the probability to low and used

Eq. (16) for the mutation process.

Mw2Hmax −H/Hmax −Havg,H>Havg

w4,H≤Havg

(16)

where w1,w2,w3and w4are weight parameters. If the current

ﬁtness value is higher than the previous one, then we replace it

with the new one. The process will only be ﬁnished when cri-

teria are met. The purpose is to evolve these chromosomes by

generating better ones until optimal descriptors are obtained.

The working mechanism of AGA is depicted in Algorithm

3.5.3 Classiﬁer: Multilayer Neuro-Fuzzy

The multi-fused optimized descriptors from AGA are fed

to a multilayer Neuro-Fuzzy Classiﬁer for decision making.

In NFC, we fused the learning ability of neural networks

and knowledge representation of fuzzy logic. This fusion

makes the classiﬁer less dependent on expert knowledge and

more systematic. Our NFC consists of ﬁve layers namely,

the Input layer, fuzziﬁcation layer, Fuzzy rules layer, Output

membership functions layer, and defuzziﬁcation layer. The

role of each layer is described as follows:

In Layer 1: Input layer, the extracted optimized descriptors

are fed to multilayer NFC, which has two inputs x and y,

where x represents the spatial descriptors (A1, A2) and y

represent the temporal descriptors (B1, B2, B3).

In Layer 2: Fuzziﬁcation layer, the crisp descriptor values are

transformed into fuzzy values using membership functions

as depicted in Eq. (17).

iμfi(x), where i1, 2, 3 (17)

where O2

irepresents the output of layer 2, and μfi(x)can

accept any fuzzy membership function. In our work, we used

123

Arabian Journal for Science and Engineering

Fig. 18 Accuracy of multilayer NFC with a different number of fuzzy

rules

a combination of three membership functions namely: R,

triangular, and L to create the intersection points for the

respective NFC inputs, the regions under the membership

function are called fuzzy regions. Membership values that

come purely under R fuzzy region are considered normal

and those that come purely under the L region are predicted

as abnormal.

In fuzzy rule layer: Layer 3, fuzzy rules were deﬁned to

evaluate the fuzzy values. A sample of fuzzy rules is shown

in Eq. (18).

Rule N:IFO2

iis tN

i,THENZNw3

i(18)

where tN

iis the threshold value and w3

iis the weight assigned

by layer 3. In our experimentations, we used seven if–then

fuzzy rules. The selection of seven rules is because of their

higher accuracy rate. Figure 18 shows the accuracy of our

system with different choices of rules.

Output membership function layer: Layer 4, provides the

output membership values which are then normalized for the

next layer.

In defuzziﬁcation layer: Layer 5, the defuzziﬁcation is

done by performing the MAX operation. Figure 19 depicts

the architecture of NFC.

4 Experimental Setup and Results

This section elaborates the detail of all experiments per-

formed to validate the proposed model. All processing and

experimentations were performed on MATLAB and Google

Colab (Python) tools. The hardware system used was Intel

Core i5-6200U with 2.40 GHz processing power, 8 GB

RAM, 2 GB dedicated graphics card Nvidia 920 M having

×64-based Windows 10 pro. We segregate the experiments

into three parts. In the ﬁrst part, we evaluate the perfor-

mance of crowd counting and tracking using UCSD and

∑

Layer 3

Layer 2

Layer 4

Layer 5

Layer 1

Fig. 19 Architecture of proposed multilayer neuro-fuzzy classiﬁer for

anomaly detection

Mall datasets. In the second part, we evaluate the perfor-

mance of Anomaly detection with UMN and MED datasets.

Lastly, in the third part, we compare our proposed model

with other state-of-the-art methods. This section is further

split into three subsections: dataset description, performance

matric with results, and discussion.

4.1 Datasets Description

For Crowd tracking, we used two datasets: UCSD and Mall

datasets. However, the UMN crowd and MED Datasets are

used for Anomaly detection. Details of each of the datasets

are mentioned in the following subsections.

4.1.1 UCSD Pedestrian Dataset

UCSD dataset [45], is a 2000-frame video dataset containing

videos of pedestrians, captured from a stationary camera on

UCSD walkways. The average crowd count for the UCSD

dataset is 24.8 in one frame. The videos were recorded with

10 frames per second having a video resolution of 328 ×158.

4.1.2 Mall Pedestrian Dataset

The Mall pedestrian’s dataset [46], consists of 2000 frames of

pedestrians, captured inside a shopping mall using a publicly

accessible surveillance camera. The average crowd count for

Mall dataset is 31.2 in one frame. The videos were recorded

123

Arabian Journal for Science and Engineering

Table 3 Measurements of MAE

and MSE for pedestrian counting

over UCSD pedestrian and mall

datasets

Model UCSD MALL

MAE MSE MAE MSE

Proposed model 1.69 2.09 2.57 4.34

Bold values indicate the mean values

Mean Accuracy = 96.69%

Fig. 20 Confusion matrix for semantic segmentation accuracy over

UCSD dataset. *car car, sdk sidewalk, hmn human, bcy bicycle,

grass grass, sky sky, tree tree, sktb skateboard, road road

Mean Accuracy = 96.03%

Fig. 21 Confusion matrix for semantic segmentation accuracy over

UMN dataset. *car car, sdk sidewalk, hmn human, bcy bicycle,

grass grass, sky sky, tree tree, sktb skateboard, road road

with < 2 frames per second (fps) having a video resolution

of 640 ×480.

4.1.3 UMN Dataset

UMN dataset [47] was captured at the University of Min-

nesota. The dataset contains 11 videos, with three different

scenes, speciﬁcally, one indoor and two outdoor scenes.

There are six scenarios in the indoor scene, with a total of

4144 frames. However, two outdoor scenes, the plaza scene,

with three scenarios and 2142 frames, and the lawn scene,

with two scenarios and 1453 frames. Each video in UMN

dataset starts with a normal scene and ends with sequences

of abnormal crowd behavior.

4.1.4 MED Dataset

MED dataset [48] consists of videos recorded using an immo-

bile video camera elevated at height with a video resolution

of 554 ×235. Each video in MED begins with a normal scene

and terminates on abnormal once, with different crowd den-

sities changing from sparse to very crowded.

4.2 Performance Metrics and Results

We used seven evaluation metrics to measure the perfor-

mance of our proposed model. For evaluating the perfor-

mance of crowd counting we used two universal quantitative

metrics i.e. Mean Square Error (MSE) and Mean Absolute

Error (MAE).

MAE 1



v1|Gv−Hv|(19)

MSA 1



v1

(Gv−Hv)2(20)

where Tis the total number of testing frames, while Gvand

Hvare the predicted and ground-truth count of pedestrians,

respectively. However, the performance of pedestrian track-

ing and crowd anomaly detection was evaluated with ﬁve

evaluation metrics i.e. Accuracy, Precision, Recall, F1score,

and confusion matrix [49,50].

Accuracy TP + TN

TP + TN + FP + FN (21)

Precision TP

TP + FP (22)

Recall TP

TP + FN (23)

123

Arabian Journal for Science and Engineering

Table 4 Measurements of

accuracy, recall and F1score for

crowd tracking over UCSD

pedestrian dataset

Sequence no (frame 70) GROUND

TRUTH

TP FN FP Accuracy Recall F1score

49900111

9 14 13 1 0 0.928 0.928 0.962

16 17 15 2 0 0.882 0.882 0.937

24 21 19 2 0 0.904 0.904 0.949

30 25 22 3 0 0.880 0.880 0.936

Mean accuracy 91.8%

Table 5 Measurements of

accuracy, recall and F1score for

crowd tracking over mall

pedestrian dataset

Sequence no (frame 70) Ground truth TP FN FP Accuracy Recall F1score

5 11 10 1 0 0.909 0.909 0.952

11 15 14 1 0 0.933 0.933 0.965

17 21 19 2 0 0.904 0.904 0.949

22 26 22 4 0 0.846 0.846 0.916

30 30 26 4 0 0.866 0.866 0.928

Mean accuracy 89.16%

Mean Accuracy of Event Detection = 93.5%

Fig. 22 Confusion matrix with mean accuracy for crowd anomaly detec-

tion on UMN dataset

Table 6 Measurements of precision, recall and F1score for crowd

anomaly detection over UMN dataset

Events Precision Recall F1score

Normal 0.951 0.98 0.964

Abnormal 0.979 0.95 0.963

Average 0.965 0.965 0.963

Bold values indicate the mean values

F1score 2×Precision ×Recall

Precision + Recall (24)

where TP represents true positive, TN is a true negative, FP

is false positive, and FN is false negative, respectively. How-

ever, Percentage of the total count of accurate classiﬁcation is

referred to as accuracy, whereas precision refers to the close-

ness of the measurements to each other, while the percentage

Mean Accuracy of Event Detection = 92.5%

Fig. 23 Confusion matrix with mean accuracy for crowd anomaly detec-

tion on MED dataset

Table 7 Measurements of precision, recall and F1score for crowd

anomaly detection over MED dataset

Events Precision Recall F1score

Normal 0.923 0.96 0.941

Abnormal 0.958 0.92 0.938

Average 0.940 0.94 0.939

Bold values indicate the mean values

of real positive that are classiﬁed as anomalous is referred to

as recall and the F1score is the measure of test accuracy. First,

we evaluate the performance accuracy of semantic segmenta-

tion for multi-object detection, Figs. 1and 2show confusion

matrices for segmentation accuracy over UCSD and UMN

datasets.

123

Arabian Journal for Science and Engineering

Table 8 Comparative analysis for crowd counting with other state-of-

the-art methods in terms of mean absolute error (MAE) and mean square

error (MSE) over UCSD and MALL datasets

Methods UCSD MALL

MAE MSE MAE MSE

MSHD [51] 2.10 6.05 2.90 14.04

Geometric head detection [52] 2.05 4.93 4.09 14.9

ST-CNN [53] – – 4.03 5.87

DigCrowd [54]3.2116.4

AMS-CNN (FV-3) [55] 2.46 3.32 2.94 3.63

FSAC-Meta learning [56] 3.08 4.16 2.44 3.12

DIR-CDCC [57] 1.79 2.47 2.36 3.12

Proposed model 1.69 2.09 2.57 4.34

Bold values indicate the mean values

Table 9 Accuracy comparison of the proposed approach to stat-of-the-

art crowd tracking methods over UCSD dataset

Models Average accuracy (%)

DDPMO [58] 81.3

MHT [59] 65.3

DCEM [60] 77.9

TBC [61] 82.1

PGM based IEC model [62] 86.9

Proposed model 91.8

Bold value indicates the mean values

4.2.1 Experiment 1: Crowd Counting and Tracking Over

UCSD and Mall Datasets

We evaluate the performance of our proposed crowd count-

ing and tracking system on two publicly available benchmark

datasets, i.e., UCSD and Mall datasets. For evaluating

the classiﬁcation accuracy of NFC, the experiments were

rehearsed three times on testing sets for each dataset inde-

pendently. Table 3shows MAE and MSE for our proposed

pedestrian crowd counting system over UCSD and Mall

datasets (Figs. 20,21).

Tables 4and 5represent the mean accuracy along with

recall and F1score for the crowd tracking system over UCSD

and Mall datasets for 30 sequences, whereas one sequence

consists of 70 frames.

4.2.2 Experiment 2: Crowd Anomaly Detection Over UMN

and MED Datasets

In this experiment, we evaluate the performance of our crowd

anomaly detection system using confusion matrix, preci-

sion, recall, and F1score over UMN and MED benchmark

datasets. Figure 22 depicts the confusion matrix and Table

6shows the performance measurements for crowd anomaly

detection over the UMN dataset for the ﬁrst 30 sequences.

Figure 23 shows the confusion matrix for anomaly detec-

tion and Table 7represents the performance measures for

anomaly detection over Motion Estimation Dataset (MED).

4.2.3 Experiment 3: Crowd Counting, Tracking and Anomaly

Detection Comparison with Stat-of-the-art Methods

In this experiment, we compare our proposed system with

other well-known methods. Table 8shows a comparison of

our proposed crowd counting system with other state-of-the-

art methods.

In Table 9, a comparison between our introduced crowd

tracking system and other state-of-the-art systems shows that

our system achieved a higher accuracy rate as compared to

existing crowd tracking methods.

Table 10 represents a comparison of our proposed crowd

anomaly detection system with other state-of-the-art systems

on UMN dataset with two outdoor scenes (scenes 1 and 3) and

one indoor scene (scene 2). As depicted our system achieved

Table 10 Accuracy comparison

of the proposed approach to

state-of-the-art crowd anomaly

detection methods over UMN

dataset

Methods Scene 1 Scene 2 Scene 3 Overall accuracy (%)

Cell Structure [63] – – – 88.3

LECM-2 [64] 98.5 87.9 94.6 93.01

CLMR (modiﬁed-GT) [65] 98.7 93.7 97.1 96.5

2STG-AKSVD2[66] 89.80 76.42 89.50 85.20

MU-LSTM [67] 97949394.6

Tracklet Clustering [68]97899894.3

MMAV [69] – – – 90.0

PGM based IEC Model [62] 87.43 83.21 90.63 86.06

Proposed model 99.23 95.56 97.89 96.59

Bold value indicates the mean values

123

Arabian Journal for Science and Engineering

a higher accuracy rate as compared to the other well-known

existing methods.

5 Conclusion

In this article, we introduced the idea of semantic seg-

mentation for foreground extraction. We used clustering

techniques and dynamic template matching for crowd count-

ing and tracking. However, for anomaly detection, new

robust Spatio-temporal descriptors are extracted and opti-

mized using AGA and passed to multilayer NFC for anomaly

detection. Through detailed experimentations, we efﬁciently

proved the ability of our proposed system in a crowded envi-

ronment. The accuracy of our tracking system goes down

marginally in case of dense crowd, which is mainly because

of full occlusions that occurred during test videos. We evalu-

ate the performance of our proposed model on four publically

available benchmark datasets and achieved superior accuracy

rates as compared to other existing state-of-the-art systems.

The proposed system can be deployed to great beneﬁt in

various public places, such as political rallies, public celebra-

tions, airports, train stations, and shopping malls to control,

protect and supervise the crowd.

In the future, we aim to work on more complex crowd

environments and solve the occlusion problem by introduc-

ing some new occlusion reasoning methods. Furthermore,

we have a plan to extend our work to recognition of different

scenes like riots or chaotic acts, ﬁghts, sports, robbery, and

road accident scenes.

Author Contributions Conceptualization, F.A., and A.J.; methodology,

F.A., and A.J.; software, F.A.; Validation, F.A., and A.J.; formal anal-

ysis, F.A., and A.J.; resources, A.J.; writing-review and editing, F.A.,

and A.J.: Authors have read and agreed to the published version of the

manuscript.

Data Availability Data sharing is not applicable.

Declarations

Conﬂict of interest The authors declare no conﬂict of interest.

References

1. Mo, H.; Wu, W.: Background noise ﬁltering and distribution divid-

ing for crowd counting. IEEE Trans. Image Process. 29, 8199–8212

(2020)

2. Rezaei, K.; Mohini, M.K.: A survey on deep learning-based

real-time crowd anomaly detection for secure distributed video

surveillance. In: Personal and Ubiquitous Computing, pp. 1–17.

Springer (2021)

3. Bal Sundaram, A.; Chaliapin, C.: An intelligent video analytics

model for abnormal event detection in online surveillance video. J.

Real-Time Image Process. 17, 915–930 (2020)

4. Jalal, A.; Batool, M.: Sustainable wearable system: human behav-

ior modeling for life-logging activities using K-ary tree hashing

classiﬁer. Sustainability 12, 10324 (2020)

5. Wang, Q.; Yuan, Y.: Pixel-wise crowd understanding via synthetic

data. Int. J. Comput. Vis. 129, 225–245 (2021)

6. Nida, K.; Yazeed, G.; Jalal, A.: Semantic recognition of human-

object interactions via Gaussian-based elliptical modeling and

pixel-level labeling. IEEE Access 6, 66 (2021)

7. Tripathi, G.; Vishwakarma, D.K.: Convolutional neural networks

for crowd behavior analysis: a survey. Vis. Comput. 35, 753–776

(2019)

8. Gochoo, M.; Jalal, A.: Monitoring real-time personal locomotion

behaviors over smart indoor-outdoor environments via body-worn

sensors. IEEE Access 6, 66 (2021)

9. Lentz’s, A.; Vrakas, D.: Non-intrusive human activity recognition

and abnormal behavior detection on elderly people. Artif. Intell.

Rev. 53, 1975–2021 (2020)

10. Nadeem, A.; Jalal, A.: Human actions tracking and recognition

based on body parts detection via artiﬁcial neural network. In: Pro-

ceedings of the ICACS, pp. 1–6. IEEE (2020)

11. Akhter, I.; Jalal, A.; Kim, K.: Adaptive pose estimation for gait

event detection using context-aware model and hierarchical opti-

mization. Proc. EE&T 16, 2721–2729 (2021)

12. Grant, J.M.; Flynn, P.J.: Crowd scene understanding from video: a

survey. ACM Trans. Multimedia Comput. Commun. Appl. TOMM

13(2), 1–23 (2017)

13. Gochoo, M.; Jalal, A.: Stochastic remote sensing event classiﬁ-

cation over adaptive posture estimation via deep belief network.

Remote Sens. 13, 912 (2021)

14. Al-Shaery, A.M.; Khozium, M.O.: In-depth survey to detect, mon-

itor and manage crowd. IEEE Access 8, 209008–209019 (2020)

15. Mahmood, M.; Jalal, A: Robust spatio-temporal features for human

interaction recognition via artiﬁcial neural network. In: Proceed-

ings of the FIT, pp. 218–223. IEEE (2018)

16. Nadeem, A.; Jalal, A.: Automatic human posture estimation for

sports activity recognition with robust body parts detection. Mul-

timedia Tools Appl. 80, 21465–21498 (2021)

17. Xu, T.; Wang, W.: Crowd counting using accumulated HOG. In:

Proceedings of the ICNC-FSKD, pp. 1877–1881. IEEE (2016)

18. Jalal, A.; Mahmood, M.: Multi-features descriptors for human

activity tracking and recognition in Indoor-outdoor environments.

In: Proceedings of the IBCAST, pp. 371–376. IEEE (2019)

19. Wan, J.; Chan, A.: Adaptive density map generation for crowd

counting. In: Proceedings of the CVF, pp. 1130–1139. IEEE (2019)

20. Pervaiz, M.; Jalal, A.: Hybrid algorithm for multi people counting

and tracking for smart surveillance. In: Proceedings of the IBCAST,

pp. 530–535 (2021)

21. Zhang, X.; Zhang, L.: Real-time crowd counting with human detec-

tion and human tracking. In: Proceedings of the on NIP, pp. 1–8.

Springer. Cham (2014)

22. Khan, S.D.; Cheikh, F.A.: Disam: Density independent and scale

aware model for crowd counting and localization. In: Proceedings

of the ICIP, pp. 4474–4478. IEEE (2019)

23. Gochoo, M.; Jalal, A.: A systematic deep learning-based overhead

tracking and counting system using RGB-D remote cameras. Appl.

Sci. 11, 5503 (2021)

123

Arabian Journal for Science and Engineering

24. Pervaiz, M.; Jalal, A.: Smart surveillance system for people

counting and tracking using particle ﬂow and modiﬁed SOM. Sus-

tainability 13, 5367 (2021)

25. Merad, D.; Drap, P.: Tracking multiple persons under partial and

global occlusions: application to customers’ behavior analysis. Pat-

tern Recogn. Lett. 81, 11–20 (2016)

26. Chahyati, D.; Arymurthy, A.M.: Multiple human tracking using

Retinanet features, Siamese neural network, and Hungarian algo-

rithm. In: Proceedings of the IAEME, vol. 10, pp. 465-475 (2020)

27. Pradeepa, B.; Vaidehi, V.: Anomaly detection in crowd scenes using

streak ﬂow analysis. In: Proceedings of the WiSPNET, pp. 363–368

(2019)

28. Kim, K.; Jalal, A.: Vision-based human activity recognition system

using depth silhouettes. J. Electr. Eng. Technol. 14, 2567–2573

(2019)

29. Zhang, X.; Stevens, B.: Scene perception guided crowd anomaly

detection. Neurocomputing 414, 291–302 (2020)

30. Khan, M.U.K.; Kyung, C.M.: Rejecting motion outliers for efﬁcient

crowd anomaly detection. IEEE Trans. IFS 14, 541–556 (2018)

31. Shehzad, A.; Jalal, A.; Kim, K.: Multi-person tracking in smart

surveillance system for crowd counting and normal/abnormal

events detection. In: Proceedings of the ICAEM, pp. 163–168.

IEEE (2019)

32. Yimin, D.O.U.; Wei, C.: Abnormal behavior detection based on

optical ﬂow trajectory of human joint points. In: Proceedings of

the CCDC, pp. 653–658. IEEE (2019)

33. Nawaratne, R.; Yu, X.: Spatiotemporal anomaly detection using

deep learning for real-time video surveillance. IEEE Trans. Ind.

Inf. 16, 393–402 (2019)

34. Chen, T.; Chen, H.: Anomaly detection in crowded scenes using

motion energy model. Multimedia Tools Appl. 77, 14137–14152

(2018)

35. Khalid, N.; Jalal, A.; Kim, K.: Modeling two-person segmentation

and locomotion for stereoscopic action identiﬁcation. Sustainabil-

ity 13, 970 (2021)

36. Minaee, S.; Terzopoulos, D.: Image segmentation using deep learn-

ing: a survey. In: Proceedings of the TPAMI. IEEE (2021)

37. Jalal, A.; Ahmed, A.; Kim, K.: Scene semantic recognition based

on modiﬁed fuzzy C-mean and maximum entropy using object-to-

object relations. IEEE Access 9, 27758–27772 (2021)

38. Raﬁque, A.A.; Jalal, A.: Statistical multi-objects segmentation for

indoor/outdoor scene detection and classiﬁcation via depth images.

In: Proceedings of the IEEE, pp. 271–276. IBCAST (2020)

39. Sahbani, B.; Adiprawita, W.: Kalman ﬁlter and iterative-hungarian

algorithm implementation for low complexity point tracking as

part of fast multiple object tracking system. In: Proceedings of the

ICSET, pp. 109–115. IEEE (2016)

40. Jalal, A.; Sarif, N.; Kim, T.S.: Human activity recognition via

recognized body parts of human depth silhouettes. Indoor Built

Environ. 22, 271–279 (2013)

41. Alzahrani, A.J.; Ullah, H.: Anomaly detection in crowds by fusion

of novel feature descriptors. J. Eng. Manag. Technol. 11, 11A16B-1

(2020)

42. Jalal, A.; Kim, Y.; Kim, D.: Ridge body parts features for human

pose estimation and recognition from RGB-D video data. In: Pro-

ceedings of the ICCCNT, pp. 1–6. IEEE (2014)

43. Zhu, H.D.; Zhong, Y.: Feature selection method by applying par-

allel collaborative evolutionary genetic algorithm. J. Electr. Sci.

Technol. 8, 108–113 (2010)

44. Jalal, A.; Lee, S.; Kim, T.S.: Human activity recognition via the

features of labeled depth body parts. In: Proceedings of the Smart

Homes and Health Telematics, pp. 246–249. Springer (2012)

45. Chan, A.B.; Vasconcelos, N.: Modeling, clustering, and segment-

ing video with mixtures of dynamic textures. In: Proceedings of

the IEEE TPAMI, vol. 30, pp. 909–926 (2008)

46. Chen, K.; Xiang, T.: Feature mining for localised crowd counting.

In Bmvc. 1, 3 (2012)

47. Mehran, R.; Shah, M.: Abnormal crowd behavior detection using

social force model. In: Proceedings of the CVPR, pp. 935–942

(2009)

48. Rabiee, H.; Murino, V.: Novel dataset for ﬁne-grained abnormal

behavior understanding in-crowd. In: Proceedings of the AVSS,

pp. 95–101. IEEE (2016)

49. Chriki, A.; Kamoun, F.: Deep learning and handcrafted features

for one-class anomaly detection in UAV video. Multimedia Tools

Appl. 80, 2599–2620 (2021)

50. Abdulhussain, S.H.; Mahmmod, B.M.; Saripan, M.I.; Al-Haddad,

S.A.R.; Baker, Flayyih, W.N.; Jassim, W.A.: A fast feature extrac-

tion algorithm for image and video processing. In: Proceedings of

the IJCNN, pp. 1–8 (2019)

51. Ma, T.; Li, N.: Scene invariant crowd counting using multi-

scales head detection in video surveillance. IET Image Proc. 12,

2258–2263 (2018)

52. Miao, Y.; Zhang, B.: ST-CNN: spatial–temporal convolutional neu-

ral network for crowd counting in videos. Pattern Recogn. Lett. 125,

113–118 (2019)

53. Xu, M.; Xu, C.: Depth information guided crowd counting for com-

plex crowd scenes. Pattern Recogn. Lett. 125, 563–569 (2019)

54. Saqib, M.; Blumenstein, M.: Crowd counting in low-resolution

crowded scenes using region-based deep convolutional neural net-

works. IEEE Access 7, 35317–35329 (2019)

55. Pandey, A.; Trivedi, A.: KUMBH MELA: a case study for

dense crowd counting and modeling. Multimedia Tools Appl. 79,

17837–17858 (2020)

56. Reddy,M.K.K.; Wang, Y.: Few-shot scene adaptive crowdcounting

using meta-learning. In: Proceedings of the CVF, pp. 2814–2823.

IEEE (2020)

57. He, Y.; Gong, Y.: Error-aware density isomorphism reconstruction

for unsupervised cross-domain crowd counting. In: Proceedings of

the AAAI (2021)

58. Neiswanger, W.; Xing, E.: The dependent Dirichlet process mix-

ture of objects for detection-free tracking and object modeling. In:

Proceedings of the Artiﬁcial Intelligence Statistics, pp. 660–668

(2014)

59. Kim, C.; Rehg, J.M.: Multiple hypothesis tracking revisited. In:

Proceedings of the CV, pp. 4696–4704. IEEE (2015)

60. Milan, A.; Roth, S.: Multi-target tracking by discrete-continuous

energy minimization. IEEE TPAMI 38, 2054–2068 (2016)

61. Ren, W.; Chan, A.B.: Tracking-by-counting: using network ﬂows

on crowd density maps for tracking multiple targets. IEEE Trans.

Image Proc. 30, 1439–1452 (2020)

62. Abdullah, F.; Gochoo, M.; Jalal, A.: Multi-person tracking and

crowd behavior detection via particles gradient motion descriptor

and improved entropy classiﬁer. Entropy 23, 628 (2021)

63. Leyva, R.; Li, C.T.: Video anomaly detection with compact fea-

ture sets for online performance. IEEE Trans. Image Process. 26,

3463–3478 (2017)

64. Sezer, E.S.; Can, A.B.: Anomaly detection in crowded scenes using

log-Euclidean covariance matrix. In: Proceedings of the VISI-

GRAPP, pp. 279–286 (2018)

65. Patil, N.; Biswas, P.K.: Global abnormal events detection in

crowded scenes using context location and motion-rich spatio-

temporal volumes. IET Image Proc. 12, 596–604 (2018)

123

Arabian Journal for Science and Engineering

66. Ege, C.Ö.: Two-Stage Sparse Representation Based Abnormal

Crowd Event Detection in Videos. University of Helsinki (2020)

67. Moustafa, A.N.; Gomaa, W.: Gate and common pathway detec-

tion in crowd scenes and anomaly detection using motion

units and LSTM predictive models. Multimedia Tools Appl. 79,

20689–20728 (2020)

68. Hassanein, A.S.; Yagi, Y.: Identifying motion pathways in highly

crowded scenes: a non-parametric tracklet clustering approach.

Comput. Vis. Image Underst. 191, 102710 (2020)

69. Rehman, A.U.; Mahmood, T.; Khan, H.O.A.: Multi-modal

anomaly detection by using audio and visual cues. IEEE Access 9,

30587–30603 (2021)

Springer Nature or its licensor holds exclusive rights to this article

under a publishing agreement with the author(s) or other rightsholder(s);

author self-archiving of the accepted manuscript version of this article

is solely governed by the terms of such publishing agreement and appli-

cable law.

123

Three-dimensional atrous inception module for crowd behavior classification

Article

Full-text available

Jun 2024

Recent advances in deep learning have led to a surge in computer vision research, including the recognition and classification of human behavior in video data. However, most studies have focused on recognizing individual behaviors, whereas recognizing crowd behavior remains a complex problem because of the large number of interactions and similar behaviors among individuals or crowds in video surveillance systems. To solve this problem, we propose a three-dimensional atrous inception module (3D-AIM) network, which is a crowd behavior classification model that uses atrous convolution to explore interactions between individuals or crowds. The 3D-AIM network is a 3D convolutional neural network that can use receptive fields of various sizes to effectively identify specific features that determine crowd behavior. To further improve the accuracy of the 3D-AIM network, we introduced a new loss function called the separation loss function. This loss function focuses the 3D-AIM network more on the features that distinguish one type of crowd behavior from another, thereby enabling a more precise classification. Finally, we demonstrate that the proposed model outperforms existing human behavior classification models in terms of accurately classifying crowd behaviors. These results suggest that the 3D-AIM network with a separation loss function can be valuable for understanding complex crowd behavior in video surveillance systems.

VidAnomalyNet: An Efficient Anomaly Detection in Public Surveillance Videos Through Deep Learning Architectures

Article

Jun 2024

Assisting Visually Impaired People Using Deep Learning-based Anomaly Detection in Pedestrian Walkways for Intelligent Transportation Systems on Remote Sensing Images

Article

Full-text available

Aug 2023

Anomaly detection in pedestrian walkways of visually impaired people (VIP) is a vital research area that utilizes remote sensing and aids to optimize pedestrian traffic and improve flow. Researchers and engineers can formulate effective tools and methods with the power of machine learning (ML) and computer vision (CV) to identifying anomalies (i.e. vehicles) and mitigate potential safety hazards in pedestrian walkways. With recent advancements in ML and deep learning (DL) areas, authors have found that the image recognition problem ought to be devised as a two-class classification problem. Therefore, this manuscript presents a new sine cosine algorithm with deep learning-based anomaly detection in pedestrian walkways (SCADL-ADPW) algorithm. The proposed SCADL-ADPW technique identifies the presence of anomalies in the pedestrian walkways on remote sensing images. The SCADL-ADPW techniques focus on the identification and classification of anomalies, i.e. vehicles in the pedestrian walkways of VIP. To accomplish this, the SCADL-ADPW technique uses the VGG-16 model for feature vector generation. In addition, the SCA approach is designed for the optimal hyperparameter tuning process. For anomaly detection, the long short-term memory (LSTM) method can be exploited. The experimental results of the SCADL-ADPW technique are studied on the UCSD anomaly detection dataset. The comparative outcomes stated the improved anomaly detection results of the SCADL-ADPW technique.

MS Thesis UsmanAzmat Final

Article

Full-text available

Jul 2023

Usman Azmat

The localization and recognition of human activities are important areas for research in ubiquitous computing. Smartphone and other mobile device built-in sensors are a hot topic of research due to the rise in their use. Applications like healthcare monitoring, behavior analysis, personal safety, and entertainment can all be made possible by recognizing human activity and tracking its position. Signal noise, cellphones' flexible orientation and positioning, and the accuracy of anticipated activities and locations are some of the issues this research area faces. The suggested solution makes use of a strong architecture with two parallel branches—one for localization and the other for human activity recognition—to address these issues. Features including the maximum Lyapunov exponent, fractal dimension, mel-frequency-cepstral coefficients, and other statistical features are retrieved in the human activity recognition branch. A recursive feature elimination algorithm chooses the optimal feature combination, which is then passed to a genetic algorithm for data augmentation. For activity classification, the data is then processed via a deep neural decision forest. Stride length, step count, heading angle, and other statistical features are included in the localization branch. These features are subjected to a recursive feature elimination, a genetic algorithm, and classification by a deep neural decision forest classifier in a similar manner. The proposed system, which was trained and evaluated on three benchmark datasets, outperformed cutting-edge technologies with mean accuracies of 94.58%, 90.90%, and 90.40%, respectively, over the Sussex Huawei Locomotion dataset, the Extrasensory dataset, and the Continuous in-the-wild smartwatch activity dataset.

Context Aware Crowd Tracking and Anomaly Detection via Deep Learning and Social Force Model

Article

Full-text available

Jan 2023

The world’s expanding populace, the variety of human social factors, and the densely populated environment make humans feel uncertain. Individuals need a safety officer who generally deals with security viewpoints for this frailty. Currently, human monitoring techniques are time-consuming, work concentrated, and incapable. Therefore, autonomous surveillance frameworks are necessary for the modern day since they are able to address these problems. Nevertheless, hardships persist. The central concerns incorporate the detachment of the foreground from the scene and the understanding of the contextual structure of the environment for efficiently identifying unusual objects. In our work, we introduced a novel framework to tackle these difficulties by presenting a semantic segmentation technique for separating a foreground object. In our work, Super-pixels are generated using an improved watershed transform and then a conditional random field is implemented to obtain multi-object segmented frames by performing pixel-level labeling. Next, the Social Force model is introduced to extract the contextual structure of the environment via the fusion of a novel chosen particular histogram of an optical stream and inner force model. After using the computed social force, multi-people tracking is performed via three-dimensional template association using percentile rank and non-maximal suppression. Next, multi-object categorization is performed via deep learning Feature Pyramid Network. Finally, by considering the contextual structure of the environment, Jaccard similarity is utilized to make the decision for abnormality detection and identify the unusual objects from the scene. The invented framework is verified through rigorous investigations, and it obtained multi-people tracking efficiency of 92.2% and 89.1% over the UCSD and CUHK Avenue datasets. However, 95.2% and 93.7% abnormality detection efficiency is accomplished over UCSD and CUHK Avenue datasets, respectively.

Multi-Sensor-Based Action Monitoring and Recognition via Hybrid Descriptors and Logistic Regression

Article

Full-text available

Jan 2023

In the fields of body-worn sensors and computer vision, current research is being done to track and detect falls and activities of daily living using the automatic recognition of human actions. In the area of human–machine communication, different combinations of sensors and communication technologies are often used to capture human action. Many researchers have also worked with artificial intelligent systems to detect actions, understand scenes, and implement systems that are more efficient in human action recognition. Although effective approaches are needed to detect outdoor activities with the combination of human actions, feature extraction can be quite a complicated task in a human activity recognition system development. Thus, this paper proposed a solution to detect human activities via hybrid descriptors based on robust features and accurate results. In this study, complex backgrounds, including multiple humans in video frames, were detected. First, inertial signal and video frames are pre-processed using denoising techniques, after which the frames are used to remove the background by detecting human motions and extracting the silhouettes. Then, these silhouettes are further used to extract the human body key points to make the human skeleton. Then the time and frequency domain features are extracted for inertial signals, and geometric features are extracted for the skeleton body points. Finally, multiple feature sets are combined and fed into a zero order optimization model, after which logistic regression is utilized to recognize each action. The proposed system has been evaluated on three benchmark datasets, including, the UP Fall dataset, the University of Rzeszow Fall dataset, and the SisFall dataset and proved its significance by achieving accuracy of 91.51%, 92.98%, and 90.23%, on the aforementioned datasets respectively.

Residual Attention based Long Short-Term Memory for Anomaly Human Behaviour Detection

Conference Paper

Mar 2024

Scalable Fuzzy Multivariate Outliers Identification Towards Big Data Applications

Article

Mar 2024
APPL SOFT COMPUT

Spatial Attention Transformer Based Framework for Anomaly Classification in Image Sequences

Chapter

Full-text available

Feb 2024

With the increasing number of crimes in crowded and remote areas, there is a necessity to recognize any abnormal or violent event with the help of video surveillance systems. Anomaly detection is still a challenging task in the domain of computer vision because of its changing color, backgrounds, and illuminations. In recent years, vision transformers, along with the introduction of attention modules in deep learning algorithms showed promising results. This paper presents an attention-based anomaly detection framework that focuses on the extraction of spatial features. The proposed framework is implemented in two steps. The first step involves the extraction of spatial features with the Spatial Attention Module (SAM) and Shifted Window (SWIN) transformer. In the second step, a binary classification of abnormal or violent activities is done with extracted features via fully connected layers. A performance analysis of pretrained variants of SWIN transformers is also presented in this paper for the choice of the model. Four public benchmark datasets, namely, CUHK Avenue, University of Minnesota (UMN), AIRTLab, and Industrial Surveillance (IS) are employed for analysis and implementations. The proposed framework outperformed existing state of the art methods by 18% and 2–20% with accuracy of 98.58% (IS) and 100% (Avenue) respectively.

Unmanned Aerial Vehicles for Crowd Surveillance

Conference Paper

Nov 2023

Semantic Recognition of Human-Object Interactions via Gaussian-Based Elliptical Modeling and Pixel-Level Labeling

Article

Full-text available

Jul 2021

Human-Object Interaction (HOI) recognition, due to its significance in many computer vision-based applications, requires in-depth and meaningful details from image sequences. Incorporating semantics in scene understanding has led to a deep understanding of human-centric actions. Therefore, in this research work, we propose a semantic HOI recognition system based on multi-vision sensors. In the proposed system, the de-noised RGB and depth images, via Bilateral Filtering (BLF), are segmented into multiple clusters using a Simple Linear Iterative Clustering (SLIC) algorithm. The skeleton is then extracted from segmented RGB and depth images via Euclidean Distance Transform (EDT). Human joints, extracted from the skeleton, provide the annotations for accurate pixel-level labeling. An elliptical human model is then generated via a Gaussian Mixture Model (GMM). A Conditional Random Field (CRF) model is trained to allocate a specific label to each pixel of different human body parts and an interaction object. Two semantic feature types that are extracted from each labeled body part of the human and labelled objects are: Fiducial points and 3D point cloud. Features descriptors are quantized using Fisher’s Linear Discriminant Analysis (FLDA) and classified using K-ary Tree Hashing (KATH). In experimentation phase the recognition accuracy achieved with the Sports dataset is 92.88%, with the Sun Yat-Sen University (SYSU) 3D HOI dataset is 93.5% and with the Nanyang Technological University (NTU) RGB+D dataset it is 94.16%. The proposed system is validated via extensive experimentation and should be applicable to many computer-vision based applications such as healthcare monitoring, security systems and assisted living etc.

A survey on deep learning-based real-time crowd anomaly detection for secure distributed video surveillance

Article

Full-text available

Jun 2021
PERS UBIQUIT COMPUT

Fast and automated recognizing of abnormal behaviors in crowded scenes is significantly effective in increasing public security. The traditional procedure of recognizing abnormalities in the Web of Thing (WoT) platform comprises monitoring the activities and describing the crowd properties such as density, trajectory, and motion pattern from the visual frames. Accordingly, incorporating real-time security monitoring based on the WoT platform and machine learning algorithms would significantly enhance the influential detection of abnormal behaviors in the crowds. This paper addresses various automatic and real-time surveillance methods for abnormal event detection to recognize the dynamic crowd behavior in security applications. The critical aspect of security and protection of public places is that we cannot manually monitor the unpredictable and complex crowded environments. The abnormal behavior algorithms have attempted to improve efficiency, robustness against pixel occlusion, generalizability, computational complexity, and execution time. Similar to the state-of-the-art abnormal behavior detection of crowded scenes, we broadly classified methods into different categories such as tracking, classification based on handcrafted extracted features, classification based on deep learning, and hybrid approaches. Hybrid and deep learning methods have been found to have more satisfactory results in the classification stage. A set of video frames called Motion Emotion Dataset (MED) is employed in this study to examine the various conditions governing these methods. Incorporating an appropriate real-time approach with considering WoT platform can facilitate the analysis of crowd and individuals’ behavior for security screening of abnormal events.

A Systematic Deep Learning Based Overhead Tracking and Counting System Using RGB-D Remote Cameras

Article

Full-text available

Jun 2021

Featured Application: The proposed technique is an application for people detection and counting which is evaluated over several challenging benchmark datasets. The technique can be applied in heavy crowd assistance systems that help to find targeted persons, to track functional movements and to maximize the performance of surveillance security. Abstract: Automatic head tracking and counting using depth imagery has various practical applications in security, logistics, queue management, space utilization and visitor counting. However, no currently available system can clearly distinguish between a human head and other objects in order to track and count people accurately. For this reason, we propose a novel system that can track people by monitoring their heads and shoulders in complex environments and also count the number of people entering and exiting the scene. Our system is split into six phases; at first, prepro-cessing is done by converting videos of a scene into frames and removing the background from the video frames. Second, heads are detected using Hough Circular Gradient Transform, and shoulders are detected by HOG based symmetry methods. Third, three robust features, namely, fused joint HOG-LBP, Energy based Point clouds and Fused intra-inter trajectories are extracted. Fourth, the Apriori-Association is implemented to select the best features. Fifth, deep learning is used for accurate people tracking. Finally, heads are counted using Cross-line judgment. The system was tested on three benchmark datasets: the PCDS dataset, the MICC people counting dataset and the GOTPD dataset and counting accuracy of 98.40%, 98%, and 99% respectively was achieved. Our system obtained remarkable results.

Multi-Person Tracking and Crowd Behavior Detection via Particles Gradient Motion Descriptor and Improved Entropy Classifier

Article

Full-text available

May 2021
Entropy

Abstract: To prevent disasters and to control and supervise crowds, automated video surveillance has become indispensable. In today’s complex and crowded environments, manual surveillance and monitoring systems are inefficient, labor intensive, and unwieldy. Automated video surveillance systems offer promising solutions, but challenges remain. One of the major challenges is the extraction of true foregrounds of pixels representing humans only. Furthermore, to accurately understand and interpret crowd behavior, human crowd behavior (HCB) systems require robust feature extraction methods, along with powerful and reliable decision-making classifiers. In this paper, we describe our approach to these issues by presenting a novel Particles Force Model for multi-person tracking, a vigorous fusion of global and local descriptors, along with a robust improved entropy classifier for detecting and interpreting crowd behavior. In the proposed model, necessary preprocessing steps are followed by the application of a first distance algorithm for the removal of background clutter; true-foreground elements are then extracted via a Particles Force Model. The detected human forms are then counted by labeling and performing cluster estimation, using a K-nearest neighbors search algorithm. After that, the location of all the human silhouettes is fixed and, using the Jaccard similarity index and normalized cross-correlation as a cost function, multi-person tracking is performed. For HCB detection, we introduced human crowd contour extraction as a global feature and a particles gradient motion (PGD) descriptor, along with geometrical and speeded up robust features (SURF) for local features. After features were extracted, we applied bat optimization for optimal features, which also works as a pre-classifier. Finally, we introduced a robust improved entropy classifier for decision making and automated crowd behavior detection in smart surveillance systems. We evaluated the performance of our proposed system on a publicly available benchmark PETS2009 and UMN dataset. Experimental results show that our system performed better compared to existing well-known state-of-the-art methods by achieving higher accuracy rates. The proposed system can be deployed to great benefit in numerous public places, such as airports, shopping malls, city centers, and train stations to control, supervise, and protect crowds.

A Smart Surveillance System for People Counting and Tracking Using Particle Flow and Modified SOM

Article

Full-text available

May 2021

Based on the rapid increase in the demand for people counting and tracking systems for surveillance applications, there is a critical need for more accurate, efficient, and reliable systems. The main goal of this study was to develop an accurate, sustainable, and efficient system that is capable of error-free counting and tracking in public places. The major objective of this research is to develop a system that can perform well in different orientations, different densities, and different backgrounds. We propose an accurate and novel approach consisting of preprocessing, object detection, people verification, particle flow, feature extraction, self-organizing map (SOM) based clustering, people counting, and people tracking. Initially, filters are applied to preprocess images and detect objects. Next, random particles are distributed, and features are extracted. Subsequently, particle flows are clustered using a self-organizing map, and people counting and tracking are performed based on motion trajectories. Experimental results on the PETS-2009 dataset reveal an accuracy of 86.9% for people counting and 87.5% for people tracking, while experimental results on the TUD-Pedestrian dataset yield 94.2% accuracy for people counting and 94.5% for people tracking. The proposed system is a useful tool for medium-density crowds and can play a vital role in people counting and tracking applications.

Monitoring Real-Time Personal Locomotion Behaviors Over Smart Indoor-Outdoor Environments Via Body-Worn Sensors

Article

Full-text available

May 2021

The monitoring of human physical activities using wearable sensors, such as inertial-based sensors, plays a significant role in various current and potential applications. These applications include physical health tracking, surveillance systems, and robotic assistive technologies. Despite the wide range of applications, classification and recognition of human activities remains imprecise and this may contribute to unfavorable reactions and responses. To improve the recognition of human activities, we designed a dataset in which ten participants (five male and five female) performed 11 different activities wearing three body-worn inertial sensors in different locations on the body. Our model extracts data via a hierarchical feature-based technique. These features include time, wavelet, and time-frequency domains, respectively. Stochastic gradient descent (SGD) is then introduced to optimize selective features. The selected features with optimized patterns are further processed by multi-layered kernel sliding perceptron to develop adaptive learning for the classification of physical human activities. Our proposed model was experimentally evaluated and applied on three benchmark datasets: IM-WSHA, a self-annotated dataset, PAMAP2 dataset which is comprised of daily living activities, and an HuGaDB, a dataset which contains physical activities for aging people. The experimental results show that the proposed method achieves better results and outperforms others in terms of recognition accuracy, achieving an accuracy rate of 83.18%, 94.16%, and 92.50% respectively, when IM-WSHA, PAMAP2, and HuGaDB datasets are applied.

Adaptive Pose Estimation for Gait Event Detection Using Context-Aware Model and Hierarchical Optimization

Article

Full-text available

Apr 2021

To understand daily events accurately, adaptive pose estimation (APE) systems require a robust context-aware model and optimal feature selection methods. In this paper, we propose a novel gait event detection (GED) system that consists of sali-ency silhouette detection, a robust body parts model and a 2D stick-model followed by a hierarchical optimization algorithm. Furthermore, the most prominent context-aware features such as energy, 0–180° intensity and distinct moveable features are proposed by focusing on invariant and localized characteristics of human postures in different event classes. Finally, we apply Grey Wolf optimization and a genetic algorithm to discriminate complex postures and to provide appropriate labels to each event. In order to evaluate the performance of proposed GED, two public benchmark datasets, UCF101 and YouTube, are examined via the n-fold cross validation method. For the two benchmark datasets, our proposed method detects the human body key points with 82.4% and 83.2% accuracy respectively. Also, it extracts the context-aware features and finally recognizes the gait events with 82.6% and 85.0% accuracy, respectively. Compared with other well-known statistical and state-of-the-art methods, our proposed method outperforms other similarly tasked methods in terms of posture detection and recognition accuracy (PDF) Adaptive Pose Estimation for Gait Event Detection Using Context-Aware Model and Hierarchical Optimization. Available from: https://www.researchgate.net/publication/351047404_Adaptive_Pose_Estimation_for_Gait_Event_Detection_Using_Context-Aware_Model_and_Hierarchical_Optimization [accessed Apr 22 2021].

Automatic human posture estimation for sport activity recognition with robust body parts detection and entropy markov model

Article

Full-text available

Jun 2021
MULTIMED TOOLS APPL

Automated human posture estimation (A-HPE) systems need delicate methods for detecting body parts and selecting cues based on marker-less sensors to effectively recognize complex activity motions. Recognition of human activities using vision sensors is a challenging issue due to variations in illumination conditions and complex movements during the monitoring of sports and fitness exercises. In this paper, we propose a novel A-HPE method that intelligently identifies human behaviours by utilizing saliency silhouette detection, robust body parts model and multidimensional cues from full-body silhouettes followed by an entropy Markov model. Initially, images are pre-processed and noise is removed to obtain a robust silhouette. Body parts models are then used to extract twelve key body parts. These key body parts are further optimized to assist the generation of multidimensional cues. These cues include energy, optical flow and distinctive values that are fed into quadratic discriminant analysis to discriminate cues which help in the recognition of actions. Finally, these optimized patterns are further processed by a maximum entropy Markov model as a recognizer engine based on transition and emission probability values for activity recognition. For evaluation, we used a leave-one-out cross validation scheme and the results outperformed existing well-known statistical state-of-the-art methods by achieving better body parts detection and higher recognition accuracy over four benchmark datasets. The proposed method will be useful for man-machine interactions such as 3D interactive games, virtual reality, service robots, e-health fitness, and security surveillance. Graphical AbstractDesign model of automatic posture estimation and action recognition.

Error-Aware Density Isomorphism Reconstruction for Unsupervised Cross-Domain Crowd Counting

Article

May 2021

This paper focuses on the unsupervised domain adaptation problem for video-based crowd counting, in which we use labeled data as source domain and unlabelled video data as target domain. It is challenging as there is a huge gap between the source and the target domain and no annotations of samples are available in the target domain. The key issue is how to utilize unlabelled videos in the target domain for knowledge learning and transferring from the source domain. To tackle this problem, we propose a novel Error-aware Density Isomorphism REConstruction Network (EDIREC-Net) for cross-domain crowd counting. EDIREC-Net jointly transfers a pre-trained counting model to target domains using a density isomorphism reconstruction objective and models the reconstruction erroneousness by error reasoning. Specifically, as crowd flows in videos are consecutive, the density maps in adjacent frames turn out to be isomorphic. On this basis, we regard the density isomorphism reconstruction error as a self-supervised signal to transfer the pre-trained counting models to different target domains. Moreover, we leverage an estimation-reconstruction consistency to monitor the density reconstruction erroneousness and suppress unreliable density reconstructions during training. Experimental results on four benchmark datasets demonstrate the superiority of the proposed method and ablation studies investigate the efficiency and robustness. The source code is available at https://github.com/GehenHe/EDIREC-Net.

Hybrid Algorithm for Multi People Counting and Tracking for Smart Surveillance

Conference Paper

Jan 2021

Semantic Segmentation Based Crowd Tracking and Anomaly Detection via Neuro-fuzzy Classifier in Smart Surveillance System

Abstract

Recommended publications

Crowd Anomaly Detection in Public Surveillance via Spatio-temporal Descriptors and Zero-Shot Classif...

Crowd Anomaly Detection in Public Surveillance via Spatio-temporal Descriptors and Zero-Shot Classif...

Tracking and Analysis of Pedestrian’s Behavior in Public Places

A Smart Surveillance System for Pedestrian Tracking and Counting using Template Matching