Content uploaded by Faisal Abdullah
Author content
All content in this area was uploaded by Faisal Abdullah on Aug 26, 2022
Content may be subject to copyright.
Arabian Journal for Science and Engineering
https://doi.org/10.1007/s13369-022-07092-x
RESEARCH ARTICLE-COMPUTER ENGINEERING AND COMPUTER SCIENCE
Semantic Segmentation Based Crowd Tracking and Anomaly
Detection via Neuro-fuzzy Classifier in Smart Surveillance System
Faisal Abdullah1
·Ahmad Jalal1
Received: 25 October 2021 / Accepted: 22 June 2022
© King Fahd University of Petroleum & Minerals 2022
Abstract
Crowd tracking and analysis of crowd behavior is a challenging research area in computer vision. In today’s crowded
environment manual surveillance systems are inefficient, labor-intensive, and unwieldy. Automated video surveillance systems
offer promising solutions to these problems and hence become a need for today’s environment. However, challenges remain.
The most important challenge is the extraction of foreground representing human pixels only, also the extraction of robust
spatial and temporal descriptors along with potent classifier is an essential part for accurate behavior detection. In this paper,
we present our approach to these challenges by inventing semantic segmentation for foreground extraction. Furthermore, for
pedestrians counting and tracking we introduced a fusion of human motion analysis and attraction force model by the weighted
averaging method that removes non-humans and non-pedestrians from the scene. The verified pedestrians are counted using a
fuzzy-c-mean algorithm and tracked via Hungarian algorithm association along with dynamic template matching technique.
However, for anomaly detection after silhouettes extraction, we invent robust Spatio-temporal descriptors including crowd
shape deformation, silhouette slicing, particles convection, dominant motion, and energy descriptors. That we optimized
using an adaptive genetic algorithm and finally, multi-fused optimal features are fed to a multilayer neuro-fuzzy classifier
for decision making. The proposed system is validated via extensive experimentations and achieved an accuracy of 91.8%
and 89.16% over UCSD and Mall datasets for crowd tracking. However, the mean absolute error and mean square error for
pedestrian counting are 1.69 and 2.09 over UCSD dataset and 2.57 and 4.34 for Mall dataset, respectively. An accuracy of
96.5% and 94% is achieved over UMN and MED datasets for anomaly detection.
Keywords Attraction force model ·Crowd shape deformation ·Multilayer neuro-fuzzy classifier ·Semantic segmentation ·
Time-domain descriptors ·Tracking and anomaly detection
1 Introduction
Automatic video surveillance is contemplated as the first
step in various artificial intelligence applications [1,2] that
are developed to track human crowds and to analyze crowd
behavior [3,4]. As automated surveillance systems rapidly
detect unusual and critical situations in a crowded environ-
ment that would absolutely assist to make adequate decisions
for safety and emergency control [5,6]. Hence, surveillance
systems are essential in complex and crowded environments
BAhmad Jalal
ahmjal@yahoo.com
Faisal Abdullah
191633@students.au.edu.pk
1Department of Computer Science, Air University, E-9,
Islamabad 44000, Pakistan
like busy streets, political rallies, airports, train stations,
and shopping malls, to automatically detect and control the
escape or panic behavior causes due to riots or chaotic acts,
stampedes, pushing, and violent events for public safety,
security, and statistical purposes [7,8].
To supervise, protect and control the crowd, density esti-
mation and tracking is a crucial video-frame analyzing
process as it provides basic descriptions of crowd status [9,
10]. However, counting and tracking in crowded scenes is
a challenging problem because of instantaneous illumina-
tion changes, different outlooks and behaviors, partial or
full occlusions, complicated background, indoor and out-
door scenes, and as the crowd increases the pixels per
human decreases [11,12]. On the other hand, challenges
faced in crowd behavior detection involve, low resolution
with dynamic background, modeling the crowd behavior,
occlusion between individuals, random variations of a crowd
123
Arabian Journal for Science and Engineering
[13,14]. Hence, accurate crowd behavior detection with the
diversity of video scenes required the extraction of robust
descriptors that provide significant information about motion
and scene changes and a strong decision-making classifier
[15,16].
In this research article, we have proposed a new robust
approach for pedestrian crowd density estimation, tracking,
and anomaly detection. We begin with pre-processing steps.
Then for foreground extraction, we introduced semantic
segmentation by labeling and clustering the pixels belong-
ing to the same class. After foreground extraction, our work
involved two facets, (i) pedestrian crowd counting and track-
ing, (ii) crowd anomaly detection. For pedestrians counting
and tracking we first, verified the extracted silhouettes by
introducing a weighted averaging fusion of HMA and AFM
and exclude the non-humans and non-pedestrians from the
scene. After that, we used a fuzzy-c-mean algorithm to count
the pedestrians. We used the Hungarian algorithm for associ-
ation and dynamic template matching for tracking. However,
for anomaly detection, we first extract new robust Spatio-
temporal descriptors including crowd shape deformation,
dominant motion, silhouette slicing, energy, and particle con-
vection descriptors. That we optimized using an adaptive
genetic algorithm. Lastly, optimized multi-fused distinguish-
able descriptors are passed through a multilayer neuro-fuzzy
classifier for anomaly detection.
The major contributions and highlights presented in this
paper are summarized as follows.
1. We proposed a robust semantic segmentation approach
for foreground extraction, which is a necessary step
for crowd estimation, tracking, and behavior analysis in
crowded scenes.
2. Fusion of AFM and HMA via weighted averaging
process is introduced for removal of non-humans and
non-pedestrians from the scene.
3. A clustering method using a fuzzy-c-mean algorithm is
used for density estimation also particles-based measure
is introduced for inferring the number of pedestrians in
each cluster.
4. Multi-scale descriptors are introduced, namely, crowd
shape deformation, dominant motion, Energy-based, sil-
houette slicing, and particle convection descriptors for
anomaly detection.
5. A multilayer Neuro-Fuzzy classifier is used to make the
decision based on multi-fused optimized descriptors for
anomaly detection. A comparative analysis is carried out
on four publicly available benchmark datasets: UCSD
and Mall datasets for crowd counting and tracking while
for anomaly detection UMN and MED datasets are used.
The rest of the article is arranged as follows: Sect. 2
succinctly reviews other state-of-the-art methods. Section 3
describes the methodology of our proposed system. Per-
formance evaluation of our proposed approach on four
benchmark datasets plus a comparison analysis and discus-
sion is given in Sect. 4. Finally, in Sect. 5, we conclude the
paper and outline the future directions.
2 Related Work
In recent years, different computer vision approaches have
been proposed by researchers for crowd density estimation,
tracking, and anomaly detection [17,18]. We divide the
related work into two subsections, the first section describes
crowd density estimation and tracking systems; however, the
second subsection describes crowd anomaly detection sys-
tems.
2.1 Crowd Density Estimation and Tracking
Various researchers have employed different models to track
and estimate crowd density [19,20]. Table 1presents the
summary of research works relevant to these models.
2.2 Crowd Anomaly Detection Systems
Numerous researchers have devoted their energies in devel-
oping systems for anomaly detection using different methods
[27,28]. Table 2shows the detailed summary of research
works relevant to these models.
3 Proposed System Framework
In this paper, we introduced a robust Semantic Segmenta-
tion based pedestrian tracking and anomaly detection system.
In our proposed system, initially, we apply pre-processing
steps then we deploy Semantic Segmentation for multiple
object detection and extract human resembled silhouettes.
After that, we segregate our work into two facets. In the
first, for crowd counting and tracking, we performed veri-
fication of extracted silhouettes by introducing a weighted
average fusion of attraction force model and human motion
analysis. Next, crowd counting is performed using a fuzzy-c-
mean algorithm and for crowd tracking, we used Hungarian
algorithm association and dynamic template matching tech-
nique. However, in the second facet for anomaly detection
after getting human silhouettes we extract Spatio-temporal
descriptors that are optimized using an adaptive genetic algo-
rithm. These optimized features are then fed to a multilayer
neuro-fuzzy classifier for anomaly detection. Figure 1depicts
the synoptic schematics of our proposed system. The detail
of each of the aforementioned modules is explained in the
following subsections.
123
Arabian Journal for Science and Engineering
Table 1 Crowd density estimation and tracking systems
Authors Methodology Highlights and limitations
Zhang et al. [21] For the detection of humans, the fusion of cascade boosted
classifier and rectangle features were used for training the
multi-scale head-shoulder, and then human tracking is
used to eliminate the duplicates and count the pedestrians
The system has certain misclassifications due to the
similar randomization of different classes
Khan et al. [22] The human head detection and localization method were
used for crowd counting. Scale map-based scale-aware
head proposals were generated that passed to the CNN for
head probabilities, which are then confirmed and added to
the count using non-maximal suppression
The author used non-maximal suppression with a fixed
threshold to get the precise location of heads, which
also affects the performance of localization when
threshold value changes, which limits the system
performance
Gochoo et al. [23] Developed Hough circular gradient transforms for head
detection and HOG-based symmetry technique for
shoulder detection, the detected heads are verified by 1D
CNN, which are then counted using cross-line judgment
technique
The system accuracy is limited with illumination changes
and in complex crowded scenes, especially in the queue
condition, the system produced some misdetection of
heads due to overlaps
Pervaiz et al. [24] The template matching method was used for human
verification. Then people counting was performed by
distributing multiple particles on humans for extraction of
particle flows that are clustered by the self-organizing map
The model accuracy decreases in the dense crowd as the
model was not able to detect human silhouettes that are
partially or fully occluded by other objects for an
extended period of time
Merad et al. [25] This research work is based on the association of two
modules, the first module is the tracking module and the
second module is the association module to recover the
global trajectories of tracked individuals in a multiple
target tracking system
The system was not effective for arbitrary movements
and overlaps, also they used the k-nearest neighbor
algorithm for re-identification strategy and the accuracy
varies as the number of k (chosen neighbors) varies
Chahyati et al. [26] The multiple target objects are detected using RetinaNet
and then the Hungarian algorithm is used for tracking
The detection accuracy of the system degrades in a
complex crowd, which limits the system accuracy
3.1 Pre-Processing
During pre-processing, videos from a static video camera are
first converted into color frames [ f1,f2,f3,...,fN] where
Nis the total number of frames. Each colored frame is then
passed through an Adaptive Median Filter (AMF) to effec-
tively exclude noise, distortion and to provide smoothing by
protecting edges. AMF works in two stages and compares
each pixel in the image to its neighbor pixels and classifies
pixels as noise by performing spatial processing. In AMF,
those pixels are labeled as impulse noise that is not struc-
turally aligned with pixels to which they are similar, as well
as different from a majority of their neighbors. The threshold
for the comparison as well as the size of the neighborhood is
adjustable. In AMF, noisy pixels are replaced by the median
pixel value of the pixels in the neighborhood that have passed
the noise labeling test. After AMF histogram equalization
was performed on the filtered image to adjust the contrast of
an image using Eq. (1).
skT(rk)(L−1)
k
j0
prrj(1)
where k0, 1, 2, …, (L−1), variable rdenotes the intensi-
ties of an input image to be processed. As usual, we assumed
that ris in the range [0 −(L−1)], with r0 representing
black and rL−1 representing white, while srepresents
the output intensity level after intensity mapping for every
pixel in the input image, having intensity r. However, pr(r)is
the probability density function (PDF) of r, where the sub-
script on pwas used to indicate that it was a PDF of r. Thus,
a processed (output) image was achieved using Eq. 1,by
mapping each pixel in the input image with intensity rkinto
a corresponding pixel with level skin the output image, as
shown in Fig. 2.
3.2 Semantic Segmentation
After pre-processing phase, for multi-object detection, we
deploy semantic segmentation [35–38]. As we know an
image is a collection of pixels, thus in semantic segmenta-
tion (SS) we classify every pixel of an image into a particular
label class, resulting in an image segmented by class. SS
is used to recognize a collection of pixels that form dis-
tinct categories. We apply a deep learning algorithm for
semantic segmentation using an encoder-decoder structure.
Our encoder-decoder network consists of an encoder module
that gradually reduces the feature maps and captures higher
semantic information and a decoder module that refines the
segmentation results along object boundaries. Figure 3shows
our encoder-decoder structure with atrous convolution for
123
Arabian Journal for Science and Engineering
Table 2 Crowd anomaly detection system
Authors Methodology Highlights and limitations
Zhang et al. [29] Anomalies are detected via fluid force model and scene
perception. Fluid features and appearance features are
extracted and a final decision was taken by one class
SVM on the bases of extracted features
The proposed model was not effective in different
scenes, hence adaptability of the method in different
cases are needed to be further improved
Karim et al. [30] Anomaly is detected on the bases of outlier rejection.
Superpixels whose direction of motion doesn’t conform
with the dominant motion are rejected, and extracted
features with a modified k-means classification
algorithm were used for anomaly detection
The proposed hand-crafted features produced
false-positive results in complex crowds. However, if
the hand-crafted features will be fused with a deep
neural network then further performance improvement
is expected
Shehzad et al. [31] This research work used the Jaccard similarity index and
template matching technique for multi-people tracking.
However, gaussian clusters were introduced for
abnormal event detection
The system detects humans via thresholding technique.
The use of thresholding in the detection module
degrades the system accuracy in occlusions and
illumination variation conditions
Yimin et al. [32] This research work detect the anomalies based on the
optical flow trajectory of joint points for every human
body, features are extracted making use of the trajectory
constraints, and the final decision is taken by SVM
The system accuracy will be seriously reduced in a
large-scale crowd with inevitable overlaps and
occlusions as it is difficult to obtain an accurate
trajectory in a complex crowd
Nawaratne et al. [33] Developed an incremental Spatio-temporal learner model
by utilizing active learning, that temporally updates on
anomalies using convolution layer that learn spatial
regularities and ConvLSTM layers that learn temporal
regularities
The system may produce some false negative detections
in case of re-occurring anomalies. Also, the system
required a large dataset and time for training
Chen et al. [34] A motion energy model was introduced that detects
anomalies by considering the sum of square differences
metric of motion information in the center and its
neighboring blocks based on the preset threshold
The use of fixed block size for calculating the motion
energy value limits the system accuracy, also the
anomaly is detected if the motion energy value is
beyond the preset threshold which is not effective in all
cases
semantic segmentation. We used the atrous spatial pyramid
pooling module as an encoder that applies atrous convolution
with different rates to probe convolutional features at multi-
ple scales. Atrous convolution allows us to extract features
computed by DCNN at an arbitrary resolution and adjust filter
field-of-view to capture multi-scale contextual information at
multiple scales. At the decoder end, we first unsampled the
features bilinearly by a factor of 4, and then low-level fea-
tures are concatenated. Furthermore, for the sake of smooth
training and increasing the importance of encoder features,
we reduce channels by including 1 ×1 convolution on the
low-level features Finally, we again bilinearly unsampled by
a factor of 4 after refining the features by 3 ×3 convolu-
tion. Figure 4. Shows results for semantic segmentation for
different random views.
3.3 Silhouettes Extraction
After multi-object detection through semantic segmentation,
we extract only those pixels that belong to the human class
as every pixel is labeled and assigned to its particular class in
SS. All the pixels other than the human class are set to zero, as
we are interested in human silhouettes only. After extracting
the pixels of the human class, we convert the image into a
binary image using Eq. (2).
bw(x,y)1ifI(x,y)>0
0ifI(x,y)≤0(2)
where Iis the image with only human class pixels and bw is
the resultant binary image containing only human silhouettes
as shown in Fig. 5b.
3.4 Crowd Density Estimation and Tracking
Authentic pedestrians crowd tracking system requires extrac-
tion of true foreground, representing human pedestrians only.
Hence, for scrupulous pedestrian tracking after semantic seg-
mentation, we performed the human silhouettes verification
step and then pedestrian counting and tracking steps are exe-
cuted.
3.4.1 Pedestrian Human Silhouettes Verification
For human silhouettes verification, we introduced a robust
Human Motion Analysis (HMA) and Attraction Force Model
123
Arabian Journal for Science and Engineering
Descriptors Extraction
no
it
ce
t
eDt
c
e
j
bO
n
o
it
c
artx
E
et
t
e
u
o
h
liS
Pre- gn
is
se
c
o
r
p
Silhouettes Verification
Pedestrian Counting
AGA Optimization
Crowd Shape
Deformation
Descriptor
Pedestrian Tracking
Anomaly Detection
Tracking Frames
Anomaly Frames
Particles
Convection
Descriptor
Energy
Descriptor
Dominant
Motion
Descriptor
Silhouettes
Slicing
Descriptor
semarFtupnI
Fig. 1 Synoptic schematics of the proposed pedestrian tracking and crowd anomaly detection system
Fig. 2 Pre-processing steps. aFiltered image using AMF, bhistogram
of filtered image, chistogram of the enhanced image, and denhanced
image
(AFM). We eliminate all the objects other than human pedes-
trians, using a fusion of HMA and AFM via the weighted
averaging method for accurate and strict pedestrian tracking.
Decoder
Atrous
Convolution
DCNN
1x1 conv
3x3 conv
3x3 conv
3x3 conv
Image
Pooling
1x1 conv
1x1 conv
Concat
Unsample
by 4
3x3 conv
Unsample
by 4
Input
Frames
Prediction
Encoder
Low level
Features
Fig. 3 Encoder–decoder structure with Atrous convolution for semantic
segmentation
In HMA, to distinguish pedestrians from other objects,
e.g. bicyclists, bikes, etc. we determined the internal motion
of a moving silhouette over time using the strategy of star
skeleton. For producing the star skeleton on extracted sil-
houettes, we first find the centroid and then connect by line
the center of each silhouette to three extremal points that are
recovered by traversing the boundaries. The three extremal
123
Arabian Journal for Science and Engineering
Fig. 4 Results of semantic segmentation. a,bover UCSD pedestrian
dataset, and cover UMN dataset
Fig. 5 Foreground extraction. aHuman class pixels only, bbinary view
of extracted human class
Fig. 6 Human motion analysis, determining of angles from a skeleton,
where xcand ycare the spatial locations of the center of the skeleton,
corresponds to the hip position on vertical and horizontal directions,
respectively
points usually represent the torso and two legs assuming the
uppermost and lowermost extremal points, as normally the
human motion is in an upright position. Since pedestrian
people exhibit periodic motion on moving; however, other
objects, i.e., bicyclists, exhibit rotational motion on moving.
Hence to analyze the motion, we measure the angle αamong
the vertical and uppermost extremal point, angle βamong
the lowermost two extremal points, and angle γamong the
end locations of two extremal points for measuring the vari-
ation of movement over time in 2D space corresponding to
the ankles as depicted in Fig. 6.
Hence, we distinguished the pedestrians from other
objects by analyzing the human motion via measuring the
variation of three angles in time as depicted in Fig. 7.
(a) (b)
Fig. 7 Human motion analysis. aStar-skeleton projected onto the sil-
houettes over an extracted frame, bmagnified view of motion analysis
(a) (b) (c)
Fig. 8 Attraction force model. aParticle conversion of all extracted
silhouettes, bmagnified view of particle conversion
(a) (b) (c)
Fig. 9 Attraction force model. aAttraction force between two particles,
bmagnified view of AF for pedestrian, cmagnified view of AF for non-
pedestrian
In AFM, first, we convert each extracted silhouette into
particles such that every silhouette was represented by a col-
lection of particles R[p1,p2,p3,…,pz], where Zrepresents
the total number of particles in one silhouette as shown in
Fig. 8. From the Physics concept, we know that.
From the Physics concept, we know that in solids, because
of less kinetic energy the particles won’t be able to overcome
the strong force of attraction, called bonds, which attracts the
particles toward each other. Using this same concept, we treat
every extracted pixel as a fluid particle and calculate the force
of attraction between particles of each extracted silhouette.
To reduce the computational complexity, we calculate the
internal force of attraction between two mutually interacting
particles using Eq. (3), as shown in Fig. 9.
123
Arabian Journal for Science and Engineering
Fip1p2
r2(3)
where Fiis the attraction force among p1and p2of the ith
silhouette where iis in the range [1 E] with E, depicting the
maximum number of silhouettes per frame, and r2represents
the distance square among p1and p2.
After attraction force calculation we discard silhouettes
having static attraction force in a sequence of frames by uti-
lizing Eq. (4).
Hs1if
dFi
dt>0
0 otherwise (4)
where dFi
dtis the change in the force of attraction with respect
to time among particles of every ith silhouette between frame
tto t+ 1. We also eliminate objects whose attraction force
is beyond a certain threshold by considering them as non-
pedestrians, i.e., bicyclists and bikes etc.
As a summary, the process in this section is to verify
extracted silhouettes obtained through SS using a fusion
of AFM and HMA via weighted averaging process and
eliminate non-humans and non-pedestrians to distinguish
pedestrians from other objects for accurately counting and
tracking the pedestrians in crowded scenes.
3.4.2 People Counting
After human pedestrian verification, we performed cluster
estimation to count these detected human silhouettes using
a fuzzy-c-mean algorithm. As each silhouette consists of a
collection of particles making a cluster, we first labeled these
clusters in each frame using Eq. (5).
LcImpk(5)
where Imis the label of cluster mand pkis the total count of
particles in one cluster, however Lcis the aftermath extracted
labeled cluster that was considered as one silhouette and
mediated in counting. We also draw bounding boxes around
every cluster to make them visually appear, hence using
labeling and cluster estimation we count all verified human
pedestrians, as depicted in Fig. 10. Since the sum of clusters
in each frame varies from frame to frame also, the sum of
particles in each cluster varies from cluster to cluster.
Inferring Number of People in Each Cluster The optimal
result of the clustering process mentioned above depicts each
pedestrian in the scene with one perspicuous cluster, hence
the count of pedestrians being equal to the count of clusters.
However, in reality, this was not always the case, because of
occlusion as the pedestrians close to each other can be singly
clustered. Hence, it can be misleading by catching the clus-
ter count itself. Thus, we proposed a particle-based measure,
Fig. 10 Pedestrian crowd counting results over different time intervals
Fig. 11 Pedestrian crowd counting results, by inferring the number of
pedestrians in each cluster on (s) Mall dataset, and bUCSD dataset
since we have already converted each extracted silhouette
into particles. In practice, a single pedestrian usually con-
sists of a specific number of particles. We first measure the
minimum number of particles required for a single pedes-
trian and then, we propose to use the following Eq. (6)for
inferring the number of pedestrians in each cluster.
Hkpk
´pn∈k
(6)
where Hkrepresents the total number of pedestrians in cluster
k, while pkis the total number of particles in cluster kand
´pn∈kis the average number of particles required for a single
pedestrian. Figure 11 shows the pedestrian counts when there
is occlusion, and two humans present in one cluster.
3.4.3 Pedestrian Association
After pedestrian counting, our next goal is to track these
pedestrians. For this purpose, we used Hungarian Algorithm
(HA) to associate people from one frame to another based
onascore[39,40]. This will maintain the identity of each
pedestrian in the scene. In our work, we define the score as
a combination of two matrices. The bounding box centers
distance, and the IoU of the bounding box. The associating
algorithm subsists of the following steps.
•Established two empty lists, one for tracking (t-1) and a
detection list t.
123
Arabian Journal for Science and Engineering
•Calculate centers distance and IoU score and store it in
a matrix by going through detection and tracking list, we
also used cost function for prioritizing each score.
•After that, we run the Hungarian algorithm. This algorithm
searches for the minimum tracking value for each detection
in a matrix using bipartite graph logic, and thus we have
a matrix that depicts the matching among detection and
tracking.
•In case of complex occlusion where bounding boxes over-
lap, having two or more matches for a single candidate,
we set max IoU to 1 and all reaming to 0. Also, as we have
a score not cost thus, we replace our 1 with −1 for easy
search of minimum value.
•Missing values in our Hungarian matrix are unmatched
detection and unmatched trackers.
•Unmatched trackers have a life of 5 frames ahead for again
association else they are removed. For unmatched detec-
tions, we initialized a new tracker for 3 frames if the new
tracker remains after that, then it will active otherwise
removed.
The result of pedestrian association is a set of trackers
each associated with detection, which will become input for
the tracking stage.
3.4.4 Pedestrian Tracking
The goal of our pedestrian tracking model is to establish
the trajectories of all pedestrians in a scene through associa-
tion and position matching in each frame of the video using
a dynamic template matching approach. The dilemma is to
search if and where there is an occurrence or at least a similar
enough occurrence of the template in the target image. For
this purpose, we have used a correlation coefficient as a mea-
sure of similarity between the reference (template) for each
location (x, y) in the target image. The result will be max-
imum for locations where the template has correspondence
(pixel by pixel) to the sub-image located at (x, y).
Our tracking module gets position information from the
detection module and associate pedestrian using HA. This
knowledge is then utilized to extract a reference image when-
ever required from the previous frame, and the module flashes
the same rectangle over the pedestrian if it is found during
searching in the frames captured from that point. Other-
wise, a new template has been generated. Since the templates
for searching are generated dynamically exhibits a dynamic
template matching algorithm. Hence, using data association,
predicted and detected pedestrians are associated and tracked
using normalized cross-correlation as a cost function and
dynamic template matching. In our tracking approach, we
represent each pedestrian as a rectangle along with their cor-
responding Id in the bottom right as depicted in Fig. 12.
Fig. 12 Sample frames of crowd tracking result at different time inter-
vals over UCSD dataset
3.5 Anomaly Detection
Categorization of interaction within crowds into normal
and abnormal requires extraction of robust Spatio-temporal
descriptors along with a strong classifier. Hence, for accurate
anomaly detection, after applying semantic segmentation
(mentioned in subsection 3.3) the extracted silhouettes are
passed through the descriptors extraction step that we opti-
mized using AGA. Finally, the decision is made by the
multilayer Neuro-Fuzzy classifier.
3.5.1 Descriptor Extraction
In this paper, we have introduced multiple distinguish-
able hybrid Spatio-temporal descriptors, including Crowd
shape deformation, silhouettes slicing, Particles Convection,
Energy, and Dominant motion descriptors.
In crowd shape deformation descriptors, we analyzed the
crowd shape deformation over time, for this we first, extract
the crowd contour, by computing the centers of all pedestri-
ans present in the scene and connecting the centers of those
pedestrians who are at the corners of the frame, thus form-
ing a biggest convex hull representing the crowd contour
as depicted in Fig. 13. After crowd contour extraction we
compare the contour of two consecutive frames using the
normalized moment of contour for computing the changes
in crowd shape. Hence, we integrate over all points of the
contour, and compute the central moment, as expressed in
Eq. (7).
Cr,s
n
x
n
y
A(x,y)x−xavgry−yavgs(7)
123
Arabian Journal for Science and Engineering
Fig. 13 Crowd shape deformation descriptors, anormal frame, babnor-
mal frame
where rand srepresenting the power to which xand ytaken
in the sum, while A(x,y) denotes the intensity of pixels in
coordinate (x,y). The summation is over all of the points
in contour boundary. The above computed moment is like a
central moment except that the values of xand yare average
values. In simple rand smoment, if rand sare both equal
to 0, then the C00 moment is just the length in points of the
contour. However, to compute the normalized moment we
divide all by an appropriate power of C00 as expressed.
Nr,sCr,s
C(r+s)/2+1
00
(8)
where Nr,srepresents the normalized moment. This normal-
ized moment computation of contour was used to compare
two contours of consecutive frames for analyzing the varia-
tions in crowd shape.
We also introduced new robust silhouette slicing descrip-
tors in the time domain. First, we create patches (slices) of
all pedestrian silhouettes present in the scene by finding their
centers and joining the centers to two vertical extreme points
by a line segment, patches are then created in such a way that
their centers lay on the line segment pertinent to the body
of each pedestrian, thus each pedestrian consists of a set of
slices/patches as shown in Fig. 14a.
The total number of slices on each pedestrian depends
upon user choice, however, in our experimentations, we fixed
the count to six because it shows good results. After slicing
we compute the motion changes over time, by first label-
ing each pedestrian and automatically choosing the four best
slices. For the selection of slices, we compute the histogram
of each slice by converting RGB to HSI and those four slices
of each pedestrian were selected automatically whose his-
togram is more equalized as compared to others. After that,
the selected slices were matched with the corresponding can-
didate slices of the same labeled human in the next frame and
computed the motion changes. For slice matching, we take
only the I channel of the HSI histogram and used Hellinger
Fig. 14 Silhouettes slicing descriptors. aSilhouettes slicing, band
cselected histograms of random slices, dunselected histogram of a
random slice
Fig. 15 Particles convection descriptors. aPCD for all human silhou-
ettes, bmagnified view of PCD for two silhouettes
distance between the candidate histogram and the model his-
togram as expressed in Eq. (9).
HDhm
i,hc
i1
√2
k
i1hm
i−hc
i2(9)
where hm
iand hc
irepresents the probability distribution of
model and candidate histogram, respectively. The model slice
at a frame matched to the candidate slice in the subsequent
frame presents the smallest Hellinger distance.
In Particles Convection Descriptor, we convert each sil-
houette present in the scene into particles such that every
silhouette is represented by a collection of particles, and esti-
mate the interaction force between particles. However, in the
computation of interaction force, we considered only those
particles that present at the silhouettes contour, as shown
in Fig. 15. Generally, in the crowded scenes, pedestrians
have certain goals and destinations and thus have the desired
velocity as well, calculated using bilinear interpolation of the
123
Arabian Journal for Science and Engineering
neighboring flow field vectors using Eq. (10).
Mc
j1−wjVxj,yj+wjVmeanxj,yj (10)
where Mc
jis the desired velocity, while Vxj,yjrepresent
the optical flow of particle jand Vmeanxj,yjis the mean
optical flow for particle jin the coordinate (xj,yj), wjis
the panic weight parameter, wj→0shows the particular
motion of pedestrian jand wj→1 depicts group motion of
pedestrian j. However, in reality, due to the presence of other
pedestrians and obstacles in a crowd the actual motion of
pedestrians would differ from their desired motion calculated
as in Eq. (11).
MjVmeanxj,yj(11)
where Mjis the actual motion of pedestrian j. As interaction
force of the particle with the environment and its neighbor-
ing particles can be calculated as the difference between the
actual velocity of the particle and its desired velocity. Hence,
utilizing the actual motion and desired motion calculated
above we compute the interaction force using Eq. (12).
IF1
Mc
j−Mj−dMj
dt(12)
where IFrepresent the resultant interaction force, while is
the relaxation parameter, and dMj
dtis the change in motion of
pedestrians over time, using this interaction force we observe
the pedestrian’s motion dynamics.
In dominant motion descriptors [41,42], we extract tra-
jectories using a set of feature point tracks expressed as.
xj
d,yj
d,dDj
strt,...,Dj
fnl,{j1, ...,N}(13)
where Nis the total count of point tracks. We used the
Lucas-Kanade tracker for tracking these feature points. Our
purpose is to cluster these point tracks into a dominant pat-
tern of motion. Some point tracks lid only a small portion
of each pedestrian’s motion. However, for dominant motion
trajectories, we used point tracks that have protracted trajec-
tories through the scene along which a hefty count of feature
point tracks exists. We cluster those point tracks that have
the same direction of motion and are spatially close to each
other. For this purpose, we used distance metrics to compare
point tracks based on the longest common subsequence. We
sample new points after every eighth frame and add them to
the tracker as in videos new pedestrians entered with time.
A new cluster is established if another track is found that
is adequately long and antithetical from any existing cluster
center. We update the cluster’s center using the least square
polynomial fit if its size outstrips the fixed value. Hence, to
attain the final dominant motion path, we merged clusters if
their center’s sameness cost exceeds 50%. Steps for dominant
motion descriptors are given in Algorithm 1.
In the energy descriptor, we observe the changes in the
energy-level distribution of all silhouettes present in the
scene, for this we store the movements of human body parts
in the form of energy maps in the energy-based matrix, hav-
ing values between 0 and 8000. After the distribution of
energy values, we extract the energy index values into a 1D
array. Mathematical representation of energy distribution is
expressed in Eq. (14).
Ei
n
0
IdR(v)(14)
where Eirepresents a 1D array, vexpressed index number,
and IdRshows RGB values of v. The energy distribution
of some random frames of the UMN dataset is shown in
Fig. 16 we analyzed the variation in distribution temporally
and used a heuristic thresholding technique to identify the
scenes where energy index values exceed beyond the thresh-
old (Fig. 17).
123
Arabian Journal for Science and Engineering
Fig. 16 Energy descriptors, aNormal frame, babnormal frame
Fig. 17 AGA optimization, aNormal optimal descriptors, babnormal
optimal descriptors
3.5.2 Event Optimization: Adaptive Genetic Algorithm
After descriptor extraction, the extracted descriptors are opti-
mized using an Adaptive Genetic Algorithm (AGA) [43,44].
Descriptor vectors are exhibited to their respective genes, to
alter them into their equivalent chromosomes in such a way
that the gene is a representative of a descriptor in a chromo-
some. The set of these chromosomes is called population.
At first, we randomly create the populations of the chro-
mosomes. After that, the chromosomes of this population
are evaluated through fitness function, which apprised how
effective a chromosome is for an optimal solution. The chro-
mosomes with maximum fitness value are selected and some
genetic operations (crossover and mutation) are applied on
selected chromosomes to form new ones. Crossover is only
performed according to above or average fitness; theadaptive
values are computed expressed as.
Ow1Hmax −H/Hmax −Havg,H>Havg
w3,H≤Havg
(15)
where Hmax and Havg are maximum and average fitness of
chromosomes in the active generation. Computed as.
Havg h
j1Hj
h,Hmax maxH1,H2,H3,...,Hf
However, for mutation, we set the probability to low and used
Eq. (16) for the mutation process.
Mw2Hmax −H/Hmax −Havg,H>Havg
w4,H≤Havg
(16)
where w1,w2,w3and w4are weight parameters. If the current
fitness value is higher than the previous one, then we replace it
with the new one. The process will only be finished when cri-
teria are met. The purpose is to evolve these chromosomes by
generating better ones until optimal descriptors are obtained.
The working mechanism of AGA is depicted in Algorithm
2.
3.5.3 Classifier: Multilayer Neuro-Fuzzy
The multi-fused optimized descriptors from AGA are fed
to a multilayer Neuro-Fuzzy Classifier for decision making.
In NFC, we fused the learning ability of neural networks
and knowledge representation of fuzzy logic. This fusion
makes the classifier less dependent on expert knowledge and
more systematic. Our NFC consists of five layers namely,
the Input layer, fuzzification layer, Fuzzy rules layer, Output
membership functions layer, and defuzzification layer. The
role of each layer is described as follows:
In Layer 1: Input layer, the extracted optimized descriptors
are fed to multilayer NFC, which has two inputs x and y,
where x represents the spatial descriptors (A1, A2) and y
represent the temporal descriptors (B1, B2, B3).
In Layer 2: Fuzzification layer, the crisp descriptor values are
transformed into fuzzy values using membership functions
as depicted in Eq. (17).
O2
iμfi(x), where i1, 2, 3 (17)
where O2
irepresents the output of layer 2, and μfi(x)can
accept any fuzzy membership function. In our work, we used
123
Arabian Journal for Science and Engineering
Fig. 18 Accuracy of multilayer NFC with a different number of fuzzy
rules
a combination of three membership functions namely: R,
triangular, and L to create the intersection points for the
respective NFC inputs, the regions under the membership
function are called fuzzy regions. Membership values that
come purely under R fuzzy region are considered normal
and those that come purely under the L region are predicted
as abnormal.
In fuzzy rule layer: Layer 3, fuzzy rules were defined to
evaluate the fuzzy values. A sample of fuzzy rules is shown
in Eq. (18).
Rule N:IFO2
iis tN
i,THENZNw3
i(18)
where tN
iis the threshold value and w3
iis the weight assigned
by layer 3. In our experimentations, we used seven if–then
fuzzy rules. The selection of seven rules is because of their
higher accuracy rate. Figure 18 shows the accuracy of our
system with different choices of rules.
Output membership function layer: Layer 4, provides the
output membership values which are then normalized for the
next layer.
In defuzzification layer: Layer 5, the defuzzification is
done by performing the MAX operation. Figure 19 depicts
the architecture of NFC.
4 Experimental Setup and Results
This section elaborates the detail of all experiments per-
formed to validate the proposed model. All processing and
experimentations were performed on MATLAB and Google
Colab (Python) tools. The hardware system used was Intel
Core i5-6200U with 2.40 GHz processing power, 8 GB
RAM, 2 GB dedicated graphics card Nvidia 920 M having
×64-based Windows 10 pro. We segregate the experiments
into three parts. In the first part, we evaluate the perfor-
mance of crowd counting and tracking using UCSD and
A1
A2
B1
B2
R1
R2
R3
R5
R6
R7
F1
x
y
F2
R4
∑
Z
Layer 3
Layer 2
Layer 4
Layer 5
B2
Layer 1
Fig. 19 Architecture of proposed multilayer neuro-fuzzy classifier for
anomaly detection
Mall datasets. In the second part, we evaluate the perfor-
mance of Anomaly detection with UMN and MED datasets.
Lastly, in the third part, we compare our proposed model
with other state-of-the-art methods. This section is further
split into three subsections: dataset description, performance
matric with results, and discussion.
4.1 Datasets Description
For Crowd tracking, we used two datasets: UCSD and Mall
datasets. However, the UMN crowd and MED Datasets are
used for Anomaly detection. Details of each of the datasets
are mentioned in the following subsections.
4.1.1 UCSD Pedestrian Dataset
UCSD dataset [45], is a 2000-frame video dataset containing
videos of pedestrians, captured from a stationary camera on
UCSD walkways. The average crowd count for the UCSD
dataset is 24.8 in one frame. The videos were recorded with
10 frames per second having a video resolution of 328 ×158.
4.1.2 Mall Pedestrian Dataset
The Mall pedestrian’s dataset [46], consists of 2000 frames of
pedestrians, captured inside a shopping mall using a publicly
accessible surveillance camera. The average crowd count for
Mall dataset is 31.2 in one frame. The videos were recorded
123
Arabian Journal for Science and Engineering
Table 3 Measurements of MAE
and MSE for pedestrian counting
over UCSD pedestrian and mall
datasets
Model UCSD MALL
MAE MSE MAE MSE
Proposed model 1.69 2.09 2.57 4.34
Bold values indicate the mean values
Mean Accuracy = 96.69%
Fig. 20 Confusion matrix for semantic segmentation accuracy over
UCSD dataset. *car car, sdk sidewalk, hmn human, bcy bicycle,
grass grass, sky sky, tree tree, sktb skateboard, road road
Mean Accuracy = 96.03%
Fig. 21 Confusion matrix for semantic segmentation accuracy over
UMN dataset. *car car, sdk sidewalk, hmn human, bcy bicycle,
grass grass, sky sky, tree tree, sktb skateboard, road road
with < 2 frames per second (fps) having a video resolution
of 640 ×480.
4.1.3 UMN Dataset
UMN dataset [47] was captured at the University of Min-
nesota. The dataset contains 11 videos, with three different
scenes, specifically, one indoor and two outdoor scenes.
There are six scenarios in the indoor scene, with a total of
4144 frames. However, two outdoor scenes, the plaza scene,
with three scenarios and 2142 frames, and the lawn scene,
with two scenarios and 1453 frames. Each video in UMN
dataset starts with a normal scene and ends with sequences
of abnormal crowd behavior.
4.1.4 MED Dataset
MED dataset [48] consists of videos recorded using an immo-
bile video camera elevated at height with a video resolution
of 554 ×235. Each video in MED begins with a normal scene
and terminates on abnormal once, with different crowd den-
sities changing from sparse to very crowded.
4.2 Performance Metrics and Results
We used seven evaluation metrics to measure the perfor-
mance of our proposed model. For evaluating the perfor-
mance of crowd counting we used two universal quantitative
metrics i.e. Mean Square Error (MSE) and Mean Absolute
Error (MAE).
MAE 1
T
T
v1|Gv−Hv|(19)
MSA 1
T
T
v1
(Gv−Hv)2(20)
where Tis the total number of testing frames, while Gvand
Hvare the predicted and ground-truth count of pedestrians,
respectively. However, the performance of pedestrian track-
ing and crowd anomaly detection was evaluated with five
evaluation metrics i.e. Accuracy, Precision, Recall, F1score,
and confusion matrix [49,50].
Accuracy TP + TN
TP + TN + FP + FN (21)
Precision TP
TP + FP (22)
Recall TP
TP + FN (23)
123
Arabian Journal for Science and Engineering
Table 4 Measurements of
accuracy, recall and F1score for
crowd tracking over UCSD
pedestrian dataset
Sequence no (frame 70) GROUND
TRUTH
TP FN FP Accuracy Recall F1score
49900111
9 14 13 1 0 0.928 0.928 0.962
16 17 15 2 0 0.882 0.882 0.937
24 21 19 2 0 0.904 0.904 0.949
30 25 22 3 0 0.880 0.880 0.936
Mean accuracy 91.8%
Table 5 Measurements of
accuracy, recall and F1score for
crowd tracking over mall
pedestrian dataset
Sequence no (frame 70) Ground truth TP FN FP Accuracy Recall F1score
5 11 10 1 0 0.909 0.909 0.952
11 15 14 1 0 0.933 0.933 0.965
17 21 19 2 0 0.904 0.904 0.949
22 26 22 4 0 0.846 0.846 0.916
30 30 26 4 0 0.866 0.866 0.928
Mean accuracy 89.16%
Mean Accuracy of Event Detection = 93.5%
Fig. 22 Confusion matrix with mean accuracy for crowd anomaly detec-
tion on UMN dataset
Table 6 Measurements of precision, recall and F1score for crowd
anomaly detection over UMN dataset
Events Precision Recall F1score
Normal 0.951 0.98 0.964
Abnormal 0.979 0.95 0.963
Average 0.965 0.965 0.963
Bold values indicate the mean values
F1score 2×Precision ×Recall
Precision + Recall (24)
where TP represents true positive, TN is a true negative, FP
is false positive, and FN is false negative, respectively. How-
ever, Percentage of the total count of accurate classification is
referred to as accuracy, whereas precision refers to the close-
ness of the measurements to each other, while the percentage
Mean Accuracy of Event Detection = 92.5%
Fig. 23 Confusion matrix with mean accuracy for crowd anomaly detec-
tion on MED dataset
Table 7 Measurements of precision, recall and F1score for crowd
anomaly detection over MED dataset
Events Precision Recall F1score
Normal 0.923 0.96 0.941
Abnormal 0.958 0.92 0.938
Average 0.940 0.94 0.939
Bold values indicate the mean values
of real positive that are classified as anomalous is referred to
as recall and the F1score is the measure of test accuracy. First,
we evaluate the performance accuracy of semantic segmenta-
tion for multi-object detection, Figs. 1and 2show confusion
matrices for segmentation accuracy over UCSD and UMN
datasets.
123
Arabian Journal for Science and Engineering
Table 8 Comparative analysis for crowd counting with other state-of-
the-art methods in terms of mean absolute error (MAE) and mean square
error (MSE) over UCSD and MALL datasets
Methods UCSD MALL
MAE MSE MAE MSE
MSHD [51] 2.10 6.05 2.90 14.04
Geometric head detection [52] 2.05 4.93 4.09 14.9
ST-CNN [53] – – 4.03 5.87
DigCrowd [54]3.2116.4
AMS-CNN (FV-3) [55] 2.46 3.32 2.94 3.63
FSAC-Meta learning [56] 3.08 4.16 2.44 3.12
DIR-CDCC [57] 1.79 2.47 2.36 3.12
Proposed model 1.69 2.09 2.57 4.34
Bold values indicate the mean values
Table 9 Accuracy comparison of the proposed approach to stat-of-the-
art crowd tracking methods over UCSD dataset
Models Average accuracy (%)
DDPMO [58] 81.3
MHT [59] 65.3
DCEM [60] 77.9
TBC [61] 82.1
PGM based IEC model [62] 86.9
Proposed model 91.8
Bold value indicates the mean values
4.2.1 Experiment 1: Crowd Counting and Tracking Over
UCSD and Mall Datasets
We evaluate the performance of our proposed crowd count-
ing and tracking system on two publicly available benchmark
datasets, i.e., UCSD and Mall datasets. For evaluating
the classification accuracy of NFC, the experiments were
rehearsed three times on testing sets for each dataset inde-
pendently. Table 3shows MAE and MSE for our proposed
pedestrian crowd counting system over UCSD and Mall
datasets (Figs. 20,21).
Tables 4and 5represent the mean accuracy along with
recall and F1score for the crowd tracking system over UCSD
and Mall datasets for 30 sequences, whereas one sequence
consists of 70 frames.
4.2.2 Experiment 2: Crowd Anomaly Detection Over UMN
and MED Datasets
In this experiment, we evaluate the performance of our crowd
anomaly detection system using confusion matrix, preci-
sion, recall, and F1score over UMN and MED benchmark
datasets. Figure 22 depicts the confusion matrix and Table
6shows the performance measurements for crowd anomaly
detection over the UMN dataset for the first 30 sequences.
Figure 23 shows the confusion matrix for anomaly detec-
tion and Table 7represents the performance measures for
anomaly detection over Motion Estimation Dataset (MED).
4.2.3 Experiment 3: Crowd Counting, Tracking and Anomaly
Detection Comparison with Stat-of-the-art Methods
In this experiment, we compare our proposed system with
other well-known methods. Table 8shows a comparison of
our proposed crowd counting system with other state-of-the-
art methods.
In Table 9, a comparison between our introduced crowd
tracking system and other state-of-the-art systems shows that
our system achieved a higher accuracy rate as compared to
existing crowd tracking methods.
Table 10 represents a comparison of our proposed crowd
anomaly detection system with other state-of-the-art systems
on UMN dataset with two outdoor scenes (scenes 1 and 3) and
one indoor scene (scene 2). As depicted our system achieved
Table 10 Accuracy comparison
of the proposed approach to
state-of-the-art crowd anomaly
detection methods over UMN
dataset
Methods Scene 1 Scene 2 Scene 3 Overall accuracy (%)
Cell Structure [63] – – – 88.3
LECM-2 [64] 98.5 87.9 94.6 93.01
CLMR (modified-GT) [65] 98.7 93.7 97.1 96.5
2STG-AKSVD2[66] 89.80 76.42 89.50 85.20
MU-LSTM [67] 97949394.6
Tracklet Clustering [68]97899894.3
MMAV [69] – – – 90.0
PGM based IEC Model [62] 87.43 83.21 90.63 86.06
Proposed model 99.23 95.56 97.89 96.59
Bold value indicates the mean values
123
Arabian Journal for Science and Engineering
a higher accuracy rate as compared to the other well-known
existing methods.
5 Conclusion
In this article, we introduced the idea of semantic seg-
mentation for foreground extraction. We used clustering
techniques and dynamic template matching for crowd count-
ing and tracking. However, for anomaly detection, new
robust Spatio-temporal descriptors are extracted and opti-
mized using AGA and passed to multilayer NFC for anomaly
detection. Through detailed experimentations, we efficiently
proved the ability of our proposed system in a crowded envi-
ronment. The accuracy of our tracking system goes down
marginally in case of dense crowd, which is mainly because
of full occlusions that occurred during test videos. We evalu-
ate the performance of our proposed model on four publically
available benchmark datasets and achieved superior accuracy
rates as compared to other existing state-of-the-art systems.
The proposed system can be deployed to great benefit in
various public places, such as political rallies, public celebra-
tions, airports, train stations, and shopping malls to control,
protect and supervise the crowd.
In the future, we aim to work on more complex crowd
environments and solve the occlusion problem by introduc-
ing some new occlusion reasoning methods. Furthermore,
we have a plan to extend our work to recognition of different
scenes like riots or chaotic acts, fights, sports, robbery, and
road accident scenes.
Author Contributions Conceptualization, F.A., and A.J.; methodology,
F.A., and A.J.; software, F.A.; Validation, F.A., and A.J.; formal anal-
ysis, F.A., and A.J.; resources, A.J.; writing-review and editing, F.A.,
and A.J.: Authors have read and agreed to the published version of the
manuscript.
Data Availability Data sharing is not applicable.
Declarations
Conflict of interest The authors declare no conflict of interest.
References
1. Mo, H.; Wu, W.: Background noise filtering and distribution divid-
ing for crowd counting. IEEE Trans. Image Process. 29, 8199–8212
(2020)
2. Rezaei, K.; Mohini, M.K.: A survey on deep learning-based
real-time crowd anomaly detection for secure distributed video
surveillance. In: Personal and Ubiquitous Computing, pp. 1–17.
Springer (2021)
3. Bal Sundaram, A.; Chaliapin, C.: An intelligent video analytics
model for abnormal event detection in online surveillance video. J.
Real-Time Image Process. 17, 915–930 (2020)
4. Jalal, A.; Batool, M.: Sustainable wearable system: human behav-
ior modeling for life-logging activities using K-ary tree hashing
classifier. Sustainability 12, 10324 (2020)
5. Wang, Q.; Yuan, Y.: Pixel-wise crowd understanding via synthetic
data. Int. J. Comput. Vis. 129, 225–245 (2021)
6. Nida, K.; Yazeed, G.; Jalal, A.: Semantic recognition of human-
object interactions via Gaussian-based elliptical modeling and
pixel-level labeling. IEEE Access 6, 66 (2021)
7. Tripathi, G.; Vishwakarma, D.K.: Convolutional neural networks
for crowd behavior analysis: a survey. Vis. Comput. 35, 753–776
(2019)
8. Gochoo, M.; Jalal, A.: Monitoring real-time personal locomotion
behaviors over smart indoor-outdoor environments via body-worn
sensors. IEEE Access 6, 66 (2021)
9. Lentz’s, A.; Vrakas, D.: Non-intrusive human activity recognition
and abnormal behavior detection on elderly people. Artif. Intell.
Rev. 53, 1975–2021 (2020)
10. Nadeem, A.; Jalal, A.: Human actions tracking and recognition
based on body parts detection via artificial neural network. In: Pro-
ceedings of the ICACS, pp. 1–6. IEEE (2020)
11. Akhter, I.; Jalal, A.; Kim, K.: Adaptive pose estimation for gait
event detection using context-aware model and hierarchical opti-
mization. Proc. EE&T 16, 2721–2729 (2021)
12. Grant, J.M.; Flynn, P.J.: Crowd scene understanding from video: a
survey. ACM Trans. Multimedia Comput. Commun. Appl. TOMM
13(2), 1–23 (2017)
13. Gochoo, M.; Jalal, A.: Stochastic remote sensing event classifi-
cation over adaptive posture estimation via deep belief network.
Remote Sens. 13, 912 (2021)
14. Al-Shaery, A.M.; Khozium, M.O.: In-depth survey to detect, mon-
itor and manage crowd. IEEE Access 8, 209008–209019 (2020)
15. Mahmood, M.; Jalal, A: Robust spatio-temporal features for human
interaction recognition via artificial neural network. In: Proceed-
ings of the FIT, pp. 218–223. IEEE (2018)
16. Nadeem, A.; Jalal, A.: Automatic human posture estimation for
sports activity recognition with robust body parts detection. Mul-
timedia Tools Appl. 80, 21465–21498 (2021)
17. Xu, T.; Wang, W.: Crowd counting using accumulated HOG. In:
Proceedings of the ICNC-FSKD, pp. 1877–1881. IEEE (2016)
18. Jalal, A.; Mahmood, M.: Multi-features descriptors for human
activity tracking and recognition in Indoor-outdoor environments.
In: Proceedings of the IBCAST, pp. 371–376. IEEE (2019)
19. Wan, J.; Chan, A.: Adaptive density map generation for crowd
counting. In: Proceedings of the CVF, pp. 1130–1139. IEEE (2019)
20. Pervaiz, M.; Jalal, A.: Hybrid algorithm for multi people counting
and tracking for smart surveillance. In: Proceedings of the IBCAST,
pp. 530–535 (2021)
21. Zhang, X.; Zhang, L.: Real-time crowd counting with human detec-
tion and human tracking. In: Proceedings of the on NIP, pp. 1–8.
Springer. Cham (2014)
22. Khan, S.D.; Cheikh, F.A.: Disam: Density independent and scale
aware model for crowd counting and localization. In: Proceedings
of the ICIP, pp. 4474–4478. IEEE (2019)
23. Gochoo, M.; Jalal, A.: A systematic deep learning-based overhead
tracking and counting system using RGB-D remote cameras. Appl.
Sci. 11, 5503 (2021)
123
Arabian Journal for Science and Engineering
24. Pervaiz, M.; Jalal, A.: Smart surveillance system for people
counting and tracking using particle flow and modified SOM. Sus-
tainability 13, 5367 (2021)
25. Merad, D.; Drap, P.: Tracking multiple persons under partial and
global occlusions: application to customers’ behavior analysis. Pat-
tern Recogn. Lett. 81, 11–20 (2016)
26. Chahyati, D.; Arymurthy, A.M.: Multiple human tracking using
Retinanet features, Siamese neural network, and Hungarian algo-
rithm. In: Proceedings of the IAEME, vol. 10, pp. 465-475 (2020)
27. Pradeepa, B.; Vaidehi, V.: Anomaly detection in crowd scenes using
streak flow analysis. In: Proceedings of the WiSPNET, pp. 363–368
(2019)
28. Kim, K.; Jalal, A.: Vision-based human activity recognition system
using depth silhouettes. J. Electr. Eng. Technol. 14, 2567–2573
(2019)
29. Zhang, X.; Stevens, B.: Scene perception guided crowd anomaly
detection. Neurocomputing 414, 291–302 (2020)
30. Khan, M.U.K.; Kyung, C.M.: Rejecting motion outliers for efficient
crowd anomaly detection. IEEE Trans. IFS 14, 541–556 (2018)
31. Shehzad, A.; Jalal, A.; Kim, K.: Multi-person tracking in smart
surveillance system for crowd counting and normal/abnormal
events detection. In: Proceedings of the ICAEM, pp. 163–168.
IEEE (2019)
32. Yimin, D.O.U.; Wei, C.: Abnormal behavior detection based on
optical flow trajectory of human joint points. In: Proceedings of
the CCDC, pp. 653–658. IEEE (2019)
33. Nawaratne, R.; Yu, X.: Spatiotemporal anomaly detection using
deep learning for real-time video surveillance. IEEE Trans. Ind.
Inf. 16, 393–402 (2019)
34. Chen, T.; Chen, H.: Anomaly detection in crowded scenes using
motion energy model. Multimedia Tools Appl. 77, 14137–14152
(2018)
35. Khalid, N.; Jalal, A.; Kim, K.: Modeling two-person segmentation
and locomotion for stereoscopic action identification. Sustainabil-
ity 13, 970 (2021)
36. Minaee, S.; Terzopoulos, D.: Image segmentation using deep learn-
ing: a survey. In: Proceedings of the TPAMI. IEEE (2021)
37. Jalal, A.; Ahmed, A.; Kim, K.: Scene semantic recognition based
on modified fuzzy C-mean and maximum entropy using object-to-
object relations. IEEE Access 9, 27758–27772 (2021)
38. Rafique, A.A.; Jalal, A.: Statistical multi-objects segmentation for
indoor/outdoor scene detection and classification via depth images.
In: Proceedings of the IEEE, pp. 271–276. IBCAST (2020)
39. Sahbani, B.; Adiprawita, W.: Kalman filter and iterative-hungarian
algorithm implementation for low complexity point tracking as
part of fast multiple object tracking system. In: Proceedings of the
ICSET, pp. 109–115. IEEE (2016)
40. Jalal, A.; Sarif, N.; Kim, T.S.: Human activity recognition via
recognized body parts of human depth silhouettes. Indoor Built
Environ. 22, 271–279 (2013)
41. Alzahrani, A.J.; Ullah, H.: Anomaly detection in crowds by fusion
of novel feature descriptors. J. Eng. Manag. Technol. 11, 11A16B-1
(2020)
42. Jalal, A.; Kim, Y.; Kim, D.: Ridge body parts features for human
pose estimation and recognition from RGB-D video data. In: Pro-
ceedings of the ICCCNT, pp. 1–6. IEEE (2014)
43. Zhu, H.D.; Zhong, Y.: Feature selection method by applying par-
allel collaborative evolutionary genetic algorithm. J. Electr. Sci.
Technol. 8, 108–113 (2010)
44. Jalal, A.; Lee, S.; Kim, T.S.: Human activity recognition via the
features of labeled depth body parts. In: Proceedings of the Smart
Homes and Health Telematics, pp. 246–249. Springer (2012)
45. Chan, A.B.; Vasconcelos, N.: Modeling, clustering, and segment-
ing video with mixtures of dynamic textures. In: Proceedings of
the IEEE TPAMI, vol. 30, pp. 909–926 (2008)
46. Chen, K.; Xiang, T.: Feature mining for localised crowd counting.
In Bmvc. 1, 3 (2012)
47. Mehran, R.; Shah, M.: Abnormal crowd behavior detection using
social force model. In: Proceedings of the CVPR, pp. 935–942
(2009)
48. Rabiee, H.; Murino, V.: Novel dataset for fine-grained abnormal
behavior understanding in-crowd. In: Proceedings of the AVSS,
pp. 95–101. IEEE (2016)
49. Chriki, A.; Kamoun, F.: Deep learning and handcrafted features
for one-class anomaly detection in UAV video. Multimedia Tools
Appl. 80, 2599–2620 (2021)
50. Abdulhussain, S.H.; Mahmmod, B.M.; Saripan, M.I.; Al-Haddad,
S.A.R.; Baker, Flayyih, W.N.; Jassim, W.A.: A fast feature extrac-
tion algorithm for image and video processing. In: Proceedings of
the IJCNN, pp. 1–8 (2019)
51. Ma, T.; Li, N.: Scene invariant crowd counting using multi-
scales head detection in video surveillance. IET Image Proc. 12,
2258–2263 (2018)
52. Miao, Y.; Zhang, B.: ST-CNN: spatial–temporal convolutional neu-
ral network for crowd counting in videos. Pattern Recogn. Lett. 125,
113–118 (2019)
53. Xu, M.; Xu, C.: Depth information guided crowd counting for com-
plex crowd scenes. Pattern Recogn. Lett. 125, 563–569 (2019)
54. Saqib, M.; Blumenstein, M.: Crowd counting in low-resolution
crowded scenes using region-based deep convolutional neural net-
works. IEEE Access 7, 35317–35329 (2019)
55. Pandey, A.; Trivedi, A.: KUMBH MELA: a case study for
dense crowd counting and modeling. Multimedia Tools Appl. 79,
17837–17858 (2020)
56. Reddy,M.K.K.; Wang, Y.: Few-shot scene adaptive crowdcounting
using meta-learning. In: Proceedings of the CVF, pp. 2814–2823.
IEEE (2020)
57. He, Y.; Gong, Y.: Error-aware density isomorphism reconstruction
for unsupervised cross-domain crowd counting. In: Proceedings of
the AAAI (2021)
58. Neiswanger, W.; Xing, E.: The dependent Dirichlet process mix-
ture of objects for detection-free tracking and object modeling. In:
Proceedings of the Artificial Intelligence Statistics, pp. 660–668
(2014)
59. Kim, C.; Rehg, J.M.: Multiple hypothesis tracking revisited. In:
Proceedings of the CV, pp. 4696–4704. IEEE (2015)
60. Milan, A.; Roth, S.: Multi-target tracking by discrete-continuous
energy minimization. IEEE TPAMI 38, 2054–2068 (2016)
61. Ren, W.; Chan, A.B.: Tracking-by-counting: using network flows
on crowd density maps for tracking multiple targets. IEEE Trans.
Image Proc. 30, 1439–1452 (2020)
62. Abdullah, F.; Gochoo, M.; Jalal, A.: Multi-person tracking and
crowd behavior detection via particles gradient motion descriptor
and improved entropy classifier. Entropy 23, 628 (2021)
63. Leyva, R.; Li, C.T.: Video anomaly detection with compact fea-
ture sets for online performance. IEEE Trans. Image Process. 26,
3463–3478 (2017)
64. Sezer, E.S.; Can, A.B.: Anomaly detection in crowded scenes using
log-Euclidean covariance matrix. In: Proceedings of the VISI-
GRAPP, pp. 279–286 (2018)
65. Patil, N.; Biswas, P.K.: Global abnormal events detection in
crowded scenes using context location and motion-rich spatio-
temporal volumes. IET Image Proc. 12, 596–604 (2018)
123
Arabian Journal for Science and Engineering
66. Ege, C.Ö.: Two-Stage Sparse Representation Based Abnormal
Crowd Event Detection in Videos. University of Helsinki (2020)
67. Moustafa, A.N.; Gomaa, W.: Gate and common pathway detec-
tion in crowd scenes and anomaly detection using motion
units and LSTM predictive models. Multimedia Tools Appl. 79,
20689–20728 (2020)
68. Hassanein, A.S.; Yagi, Y.: Identifying motion pathways in highly
crowded scenes: a non-parametric tracklet clustering approach.
Comput. Vis. Image Underst. 191, 102710 (2020)
69. Rehman, A.U.; Mahmood, T.; Khan, H.O.A.: Multi-modal
anomaly detection by using audio and visual cues. IEEE Access 9,
30587–30603 (2021)
Springer Nature or its licensor holds exclusive rights to this article
under a publishing agreement with the author(s) or other rightsholder(s);
author self-archiving of the accepted manuscript version of this article
is solely governed by the terms of such publishing agreement and appli-
cable law.
123