ArticlePDF Available

Semantic Segmentation Based Crowd Tracking and Anomaly Detection via Neuro-fuzzy Classifier in Smart Surveillance System

Authors:

Abstract

Crowd tracking and analysis of crowd behavior is a challenging research area in computer vision. In today’s crowded environment manual surveillance systems are inefficient, labor-intensive, and unwieldy. Automated video surveillance systems offer promising solutions to these problems and hence become a need for today’s environment. However, challenges remain. The most important challenge is the extraction of foreground representing human pixels only, also the extraction of robust spatial and temporal descriptors along with potent classifier is an essential part for accurate behavior detection. In this paper, we present our approach to these challenges by inventing semantic segmentation for foreground extraction. Furthermore, for pedestrians counting and tracking we introduced a fusion of human motion analysis and attraction force model by the weighted averaging method that removes non-humans and non-pedestrians from the scene. The verified pedestrians are counted using a fuzzy-c-mean algorithm and tracked via Hungarian algorithm association along with dynamic template matching technique. However, for anomaly detection after silhouettes extraction, we invent robust Spatio-temporal descriptors including crowd shape deformation, silhouette slicing, particles convection, dominant motion, and energy descriptors. That we optimized using an adaptive genetic algorithm and finally, multi-fused optimal features are fed to a multilayer neuro-fuzzy classifier for decision making. The proposed system is validated via extensive experimentations and achieved an accuracy of 91.8% and 89.16% over UCSD and Mall datasets for crowd tracking. However, the mean absolute error and mean square error for pedestrian counting are 1.69 and 2.09 over UCSD dataset and 2.57 and 4.34 for Mall dataset, respectively. An accuracy of 96.5% and 94% is achieved over UMN and MED datasets for anomaly detection.
Arabian Journal for Science and Engineering
https://doi.org/10.1007/s13369-022-07092-x
RESEARCH ARTICLE-COMPUTER ENGINEERING AND COMPUTER SCIENCE
Semantic Segmentation Based Crowd Tracking and Anomaly
Detection via Neuro-fuzzy Classifier in Smart Surveillance System
Faisal Abdullah1
·Ahmad Jalal1
Received: 25 October 2021 / Accepted: 22 June 2022
© King Fahd University of Petroleum & Minerals 2022
Abstract
Crowd tracking and analysis of crowd behavior is a challenging research area in computer vision. In today’s crowded
environment manual surveillance systems are inefficient, labor-intensive, and unwieldy. Automated video surveillance systems
offer promising solutions to these problems and hence become a need for today’s environment. However, challenges remain.
The most important challenge is the extraction of foreground representing human pixels only, also the extraction of robust
spatial and temporal descriptors along with potent classifier is an essential part for accurate behavior detection. In this paper,
we present our approach to these challenges by inventing semantic segmentation for foreground extraction. Furthermore, for
pedestrians counting and tracking we introduced a fusion of human motion analysis and attraction force model by the weighted
averaging method that removes non-humans and non-pedestrians from the scene. The verified pedestrians are counted using a
fuzzy-c-mean algorithm and tracked via Hungarian algorithm association along with dynamic template matching technique.
However, for anomaly detection after silhouettes extraction, we invent robust Spatio-temporal descriptors including crowd
shape deformation, silhouette slicing, particles convection, dominant motion, and energy descriptors. That we optimized
using an adaptive genetic algorithm and finally, multi-fused optimal features are fed to a multilayer neuro-fuzzy classifier
for decision making. The proposed system is validated via extensive experimentations and achieved an accuracy of 91.8%
and 89.16% over UCSD and Mall datasets for crowd tracking. However, the mean absolute error and mean square error for
pedestrian counting are 1.69 and 2.09 over UCSD dataset and 2.57 and 4.34 for Mall dataset, respectively. An accuracy of
96.5% and 94% is achieved over UMN and MED datasets for anomaly detection.
Keywords Attraction force model ·Crowd shape deformation ·Multilayer neuro-fuzzy classifier ·Semantic segmentation ·
Time-domain descriptors ·Tracking and anomaly detection
1 Introduction
Automatic video surveillance is contemplated as the first
step in various artificial intelligence applications [1,2] that
are developed to track human crowds and to analyze crowd
behavior [3,4]. As automated surveillance systems rapidly
detect unusual and critical situations in a crowded environ-
ment that would absolutely assist to make adequate decisions
for safety and emergency control [5,6]. Hence, surveillance
systems are essential in complex and crowded environments
BAhmad Jalal
ahmjal@yahoo.com
Faisal Abdullah
191633@students.au.edu.pk
1Department of Computer Science, Air University, E-9,
Islamabad 44000, Pakistan
like busy streets, political rallies, airports, train stations,
and shopping malls, to automatically detect and control the
escape or panic behavior causes due to riots or chaotic acts,
stampedes, pushing, and violent events for public safety,
security, and statistical purposes [7,8].
To supervise, protect and control the crowd, density esti-
mation and tracking is a crucial video-frame analyzing
process as it provides basic descriptions of crowd status [9,
10]. However, counting and tracking in crowded scenes is
a challenging problem because of instantaneous illumina-
tion changes, different outlooks and behaviors, partial or
full occlusions, complicated background, indoor and out-
door scenes, and as the crowd increases the pixels per
human decreases [11,12]. On the other hand, challenges
faced in crowd behavior detection involve, low resolution
with dynamic background, modeling the crowd behavior,
occlusion between individuals, random variations of a crowd
123
Arabian Journal for Science and Engineering
[13,14]. Hence, accurate crowd behavior detection with the
diversity of video scenes required the extraction of robust
descriptors that provide significant information about motion
and scene changes and a strong decision-making classifier
[15,16].
In this research article, we have proposed a new robust
approach for pedestrian crowd density estimation, tracking,
and anomaly detection. We begin with pre-processing steps.
Then for foreground extraction, we introduced semantic
segmentation by labeling and clustering the pixels belong-
ing to the same class. After foreground extraction, our work
involved two facets, (i) pedestrian crowd counting and track-
ing, (ii) crowd anomaly detection. For pedestrians counting
and tracking we first, verified the extracted silhouettes by
introducing a weighted averaging fusion of HMA and AFM
and exclude the non-humans and non-pedestrians from the
scene. After that, we used a fuzzy-c-mean algorithm to count
the pedestrians. We used the Hungarian algorithm for associ-
ation and dynamic template matching for tracking. However,
for anomaly detection, we first extract new robust Spatio-
temporal descriptors including crowd shape deformation,
dominant motion, silhouette slicing, energy, and particle con-
vection descriptors. That we optimized using an adaptive
genetic algorithm. Lastly, optimized multi-fused distinguish-
able descriptors are passed through a multilayer neuro-fuzzy
classifier for anomaly detection.
The major contributions and highlights presented in this
paper are summarized as follows.
1. We proposed a robust semantic segmentation approach
for foreground extraction, which is a necessary step
for crowd estimation, tracking, and behavior analysis in
crowded scenes.
2. Fusion of AFM and HMA via weighted averaging
process is introduced for removal of non-humans and
non-pedestrians from the scene.
3. A clustering method using a fuzzy-c-mean algorithm is
used for density estimation also particles-based measure
is introduced for inferring the number of pedestrians in
each cluster.
4. Multi-scale descriptors are introduced, namely, crowd
shape deformation, dominant motion, Energy-based, sil-
houette slicing, and particle convection descriptors for
anomaly detection.
5. A multilayer Neuro-Fuzzy classifier is used to make the
decision based on multi-fused optimized descriptors for
anomaly detection. A comparative analysis is carried out
on four publicly available benchmark datasets: UCSD
and Mall datasets for crowd counting and tracking while
for anomaly detection UMN and MED datasets are used.
The rest of the article is arranged as follows: Sect. 2
succinctly reviews other state-of-the-art methods. Section 3
describes the methodology of our proposed system. Per-
formance evaluation of our proposed approach on four
benchmark datasets plus a comparison analysis and discus-
sion is given in Sect. 4. Finally, in Sect. 5, we conclude the
paper and outline the future directions.
2 Related Work
In recent years, different computer vision approaches have
been proposed by researchers for crowd density estimation,
tracking, and anomaly detection [17,18]. We divide the
related work into two subsections, the first section describes
crowd density estimation and tracking systems; however, the
second subsection describes crowd anomaly detection sys-
tems.
2.1 Crowd Density Estimation and Tracking
Various researchers have employed different models to track
and estimate crowd density [19,20]. Table 1presents the
summary of research works relevant to these models.
2.2 Crowd Anomaly Detection Systems
Numerous researchers have devoted their energies in devel-
oping systems for anomaly detection using different methods
[27,28]. Table 2shows the detailed summary of research
works relevant to these models.
3 Proposed System Framework
In this paper, we introduced a robust Semantic Segmenta-
tion based pedestrian tracking and anomaly detection system.
In our proposed system, initially, we apply pre-processing
steps then we deploy Semantic Segmentation for multiple
object detection and extract human resembled silhouettes.
After that, we segregate our work into two facets. In the
first, for crowd counting and tracking, we performed veri-
fication of extracted silhouettes by introducing a weighted
average fusion of attraction force model and human motion
analysis. Next, crowd counting is performed using a fuzzy-c-
mean algorithm and for crowd tracking, we used Hungarian
algorithm association and dynamic template matching tech-
nique. However, in the second facet for anomaly detection
after getting human silhouettes we extract Spatio-temporal
descriptors that are optimized using an adaptive genetic algo-
rithm. These optimized features are then fed to a multilayer
neuro-fuzzy classifier for anomaly detection. Figure 1depicts
the synoptic schematics of our proposed system. The detail
of each of the aforementioned modules is explained in the
following subsections.
123
Arabian Journal for Science and Engineering
Table 1 Crowd density estimation and tracking systems
Authors Methodology Highlights and limitations
Zhang et al. [21] For the detection of humans, the fusion of cascade boosted
classifier and rectangle features were used for training the
multi-scale head-shoulder, and then human tracking is
used to eliminate the duplicates and count the pedestrians
The system has certain misclassifications due to the
similar randomization of different classes
Khan et al. [22] The human head detection and localization method were
used for crowd counting. Scale map-based scale-aware
head proposals were generated that passed to the CNN for
head probabilities, which are then confirmed and added to
the count using non-maximal suppression
The author used non-maximal suppression with a fixed
threshold to get the precise location of heads, which
also affects the performance of localization when
threshold value changes, which limits the system
performance
Gochoo et al. [23] Developed Hough circular gradient transforms for head
detection and HOG-based symmetry technique for
shoulder detection, the detected heads are verified by 1D
CNN, which are then counted using cross-line judgment
technique
The system accuracy is limited with illumination changes
and in complex crowded scenes, especially in the queue
condition, the system produced some misdetection of
heads due to overlaps
Pervaiz et al. [24] The template matching method was used for human
verification. Then people counting was performed by
distributing multiple particles on humans for extraction of
particle flows that are clustered by the self-organizing map
The model accuracy decreases in the dense crowd as the
model was not able to detect human silhouettes that are
partially or fully occluded by other objects for an
extended period of time
Merad et al. [25] This research work is based on the association of two
modules, the first module is the tracking module and the
second module is the association module to recover the
global trajectories of tracked individuals in a multiple
target tracking system
The system was not effective for arbitrary movements
and overlaps, also they used the k-nearest neighbor
algorithm for re-identification strategy and the accuracy
varies as the number of k (chosen neighbors) varies
Chahyati et al. [26] The multiple target objects are detected using RetinaNet
and then the Hungarian algorithm is used for tracking
The detection accuracy of the system degrades in a
complex crowd, which limits the system accuracy
3.1 Pre-Processing
During pre-processing, videos from a static video camera are
first converted into color frames [ f1,f2,f3,...,fN] where
Nis the total number of frames. Each colored frame is then
passed through an Adaptive Median Filter (AMF) to effec-
tively exclude noise, distortion and to provide smoothing by
protecting edges. AMF works in two stages and compares
each pixel in the image to its neighbor pixels and classifies
pixels as noise by performing spatial processing. In AMF,
those pixels are labeled as impulse noise that is not struc-
turally aligned with pixels to which they are similar, as well
as different from a majority of their neighbors. The threshold
for the comparison as well as the size of the neighborhood is
adjustable. In AMF, noisy pixels are replaced by the median
pixel value of the pixels in the neighborhood that have passed
the noise labeling test. After AMF histogram equalization
was performed on the filtered image to adjust the contrast of
an image using Eq. (1).
skT(rk)(L1)
k
j0
prrj(1)
where k0, 1, 2, …, (L1), variable rdenotes the intensi-
ties of an input image to be processed. As usual, we assumed
that ris in the range [0 (L1)], with r0 representing
black and rL1 representing white, while srepresents
the output intensity level after intensity mapping for every
pixel in the input image, having intensity r. However, pr(r)is
the probability density function (PDF) of r, where the sub-
script on pwas used to indicate that it was a PDF of r. Thus,
a processed (output) image was achieved using Eq. 1,by
mapping each pixel in the input image with intensity rkinto
a corresponding pixel with level skin the output image, as
shown in Fig. 2.
3.2 Semantic Segmentation
After pre-processing phase, for multi-object detection, we
deploy semantic segmentation [3538]. As we know an
image is a collection of pixels, thus in semantic segmenta-
tion (SS) we classify every pixel of an image into a particular
label class, resulting in an image segmented by class. SS
is used to recognize a collection of pixels that form dis-
tinct categories. We apply a deep learning algorithm for
semantic segmentation using an encoder-decoder structure.
Our encoder-decoder network consists of an encoder module
that gradually reduces the feature maps and captures higher
semantic information and a decoder module that refines the
segmentation results along object boundaries. Figure 3shows
our encoder-decoder structure with atrous convolution for
123
Arabian Journal for Science and Engineering
Table 2 Crowd anomaly detection system
Authors Methodology Highlights and limitations
Zhang et al. [29] Anomalies are detected via fluid force model and scene
perception. Fluid features and appearance features are
extracted and a final decision was taken by one class
SVM on the bases of extracted features
The proposed model was not effective in different
scenes, hence adaptability of the method in different
cases are needed to be further improved
Karim et al. [30] Anomaly is detected on the bases of outlier rejection.
Superpixels whose direction of motion doesn’t conform
with the dominant motion are rejected, and extracted
features with a modified k-means classification
algorithm were used for anomaly detection
The proposed hand-crafted features produced
false-positive results in complex crowds. However, if
the hand-crafted features will be fused with a deep
neural network then further performance improvement
is expected
Shehzad et al. [31] This research work used the Jaccard similarity index and
template matching technique for multi-people tracking.
However, gaussian clusters were introduced for
abnormal event detection
The system detects humans via thresholding technique.
The use of thresholding in the detection module
degrades the system accuracy in occlusions and
illumination variation conditions
Yimin et al. [32] This research work detect the anomalies based on the
optical flow trajectory of joint points for every human
body, features are extracted making use of the trajectory
constraints, and the final decision is taken by SVM
The system accuracy will be seriously reduced in a
large-scale crowd with inevitable overlaps and
occlusions as it is difficult to obtain an accurate
trajectory in a complex crowd
Nawaratne et al. [33] Developed an incremental Spatio-temporal learner model
by utilizing active learning, that temporally updates on
anomalies using convolution layer that learn spatial
regularities and ConvLSTM layers that learn temporal
regularities
The system may produce some false negative detections
in case of re-occurring anomalies. Also, the system
required a large dataset and time for training
Chen et al. [34] A motion energy model was introduced that detects
anomalies by considering the sum of square differences
metric of motion information in the center and its
neighboring blocks based on the preset threshold
The use of fixed block size for calculating the motion
energy value limits the system accuracy, also the
anomaly is detected if the motion energy value is
beyond the preset threshold which is not effective in all
cases
semantic segmentation. We used the atrous spatial pyramid
pooling module as an encoder that applies atrous convolution
with different rates to probe convolutional features at multi-
ple scales. Atrous convolution allows us to extract features
computed by DCNN at an arbitrary resolution and adjust filter
field-of-view to capture multi-scale contextual information at
multiple scales. At the decoder end, we first unsampled the
features bilinearly by a factor of 4, and then low-level fea-
tures are concatenated. Furthermore, for the sake of smooth
training and increasing the importance of encoder features,
we reduce channels by including 1 ×1 convolution on the
low-level features Finally, we again bilinearly unsampled by
a factor of 4 after refining the features by 3 ×3 convolu-
tion. Figure 4. Shows results for semantic segmentation for
different random views.
3.3 Silhouettes Extraction
After multi-object detection through semantic segmentation,
we extract only those pixels that belong to the human class
as every pixel is labeled and assigned to its particular class in
SS. All the pixels other than the human class are set to zero, as
we are interested in human silhouettes only. After extracting
the pixels of the human class, we convert the image into a
binary image using Eq. (2).
bw(x,y)1ifI(x,y)>0
0ifI(x,y)0(2)
where Iis the image with only human class pixels and bw is
the resultant binary image containing only human silhouettes
as shown in Fig. 5b.
3.4 Crowd Density Estimation and Tracking
Authentic pedestrians crowd tracking system requires extrac-
tion of true foreground, representing human pedestrians only.
Hence, for scrupulous pedestrian tracking after semantic seg-
mentation, we performed the human silhouettes verification
step and then pedestrian counting and tracking steps are exe-
cuted.
3.4.1 Pedestrian Human Silhouettes Verification
For human silhouettes verification, we introduced a robust
Human Motion Analysis (HMA) and Attraction Force Model
123
Arabian Journal for Science and Engineering
Descriptors Extraction
no
it
ce
t
eDt
c
e
j
bO
n
o
it
c
artx
E
et
t
e
u
o
h
liS
Pre- gn
is
se
c
o
r
p
Silhouettes Verification
Pedestrian Counting
AGA Optimization
Crowd Shape
Deformation
Descriptor
Pedestrian Tracking
Anomaly Detection
Tracking Frames
Anomaly Frames
Particles
Convection
Descriptor
Energy
Descriptor
Dominant
Motion
Descriptor
Silhouettes
Slicing
Descriptor
semarFtupnI
Fig. 1 Synoptic schematics of the proposed pedestrian tracking and crowd anomaly detection system
Fig. 2 Pre-processing steps. aFiltered image using AMF, bhistogram
of filtered image, chistogram of the enhanced image, and denhanced
image
(AFM). We eliminate all the objects other than human pedes-
trians, using a fusion of HMA and AFM via the weighted
averaging method for accurate and strict pedestrian tracking.
Decoder
Atrous
Convolution
DCNN
1x1 conv
3x3 conv
3x3 conv
3x3 conv
Image
Pooling
1x1 conv
1x1 conv
Concat
Unsample
by 4
3x3 conv
Unsample
by 4
Input
Frames
Prediction
Encoder
Low level
Features
Fig. 3 Encoder–decoder structure with Atrous convolution for semantic
segmentation
In HMA, to distinguish pedestrians from other objects,
e.g. bicyclists, bikes, etc. we determined the internal motion
of a moving silhouette over time using the strategy of star
skeleton. For producing the star skeleton on extracted sil-
houettes, we first find the centroid and then connect by line
the center of each silhouette to three extremal points that are
recovered by traversing the boundaries. The three extremal
123
Arabian Journal for Science and Engineering
Fig. 4 Results of semantic segmentation. a,bover UCSD pedestrian
dataset, and cover UMN dataset
Fig. 5 Foreground extraction. aHuman class pixels only, bbinary view
of extracted human class
Fig. 6 Human motion analysis, determining of angles from a skeleton,
where xcand ycare the spatial locations of the center of the skeleton,
corresponds to the hip position on vertical and horizontal directions,
respectively
points usually represent the torso and two legs assuming the
uppermost and lowermost extremal points, as normally the
human motion is in an upright position. Since pedestrian
people exhibit periodic motion on moving; however, other
objects, i.e., bicyclists, exhibit rotational motion on moving.
Hence to analyze the motion, we measure the angle αamong
the vertical and uppermost extremal point, angle βamong
the lowermost two extremal points, and angle γamong the
end locations of two extremal points for measuring the vari-
ation of movement over time in 2D space corresponding to
the ankles as depicted in Fig. 6.
Hence, we distinguished the pedestrians from other
objects by analyzing the human motion via measuring the
variation of three angles in time as depicted in Fig. 7.
(a) (b)
Fig. 7 Human motion analysis. aStar-skeleton projected onto the sil-
houettes over an extracted frame, bmagnified view of motion analysis
(a) (b) (c)
Fig. 8 Attraction force model. aParticle conversion of all extracted
silhouettes, bmagnified view of particle conversion
(a) (b) (c)
Fig. 9 Attraction force model. aAttraction force between two particles,
bmagnified view of AF for pedestrian, cmagnified view of AF for non-
pedestrian
In AFM, first, we convert each extracted silhouette into
particles such that every silhouette was represented by a col-
lection of particles R[p1,p2,p3,,pz], where Zrepresents
the total number of particles in one silhouette as shown in
Fig. 8. From the Physics concept, we know that.
From the Physics concept, we know that in solids, because
of less kinetic energy the particles won’t be able to overcome
the strong force of attraction, called bonds, which attracts the
particles toward each other. Using this same concept, we treat
every extracted pixel as a fluid particle and calculate the force
of attraction between particles of each extracted silhouette.
To reduce the computational complexity, we calculate the
internal force of attraction between two mutually interacting
particles using Eq. (3), as shown in Fig. 9.
123
Arabian Journal for Science and Engineering
Fip1p2
r2(3)
where Fiis the attraction force among p1and p2of the ith
silhouette where iis in the range [1 E] with E, depicting the
maximum number of silhouettes per frame, and r2represents
the distance square among p1and p2.
After attraction force calculation we discard silhouettes
having static attraction force in a sequence of frames by uti-
lizing Eq. (4).
Hs1if
dFi
dt>0
0 otherwise (4)
where dFi
dtis the change in the force of attraction with respect
to time among particles of every ith silhouette between frame
tto t+ 1. We also eliminate objects whose attraction force
is beyond a certain threshold by considering them as non-
pedestrians, i.e., bicyclists and bikes etc.
As a summary, the process in this section is to verify
extracted silhouettes obtained through SS using a fusion
of AFM and HMA via weighted averaging process and
eliminate non-humans and non-pedestrians to distinguish
pedestrians from other objects for accurately counting and
tracking the pedestrians in crowded scenes.
3.4.2 People Counting
After human pedestrian verification, we performed cluster
estimation to count these detected human silhouettes using
a fuzzy-c-mean algorithm. As each silhouette consists of a
collection of particles making a cluster, we first labeled these
clusters in each frame using Eq. (5).
LcImpk(5)
where Imis the label of cluster mand pkis the total count of
particles in one cluster, however Lcis the aftermath extracted
labeled cluster that was considered as one silhouette and
mediated in counting. We also draw bounding boxes around
every cluster to make them visually appear, hence using
labeling and cluster estimation we count all verified human
pedestrians, as depicted in Fig. 10. Since the sum of clusters
in each frame varies from frame to frame also, the sum of
particles in each cluster varies from cluster to cluster.
Inferring Number of People in Each Cluster The optimal
result of the clustering process mentioned above depicts each
pedestrian in the scene with one perspicuous cluster, hence
the count of pedestrians being equal to the count of clusters.
However, in reality, this was not always the case, because of
occlusion as the pedestrians close to each other can be singly
clustered. Hence, it can be misleading by catching the clus-
ter count itself. Thus, we proposed a particle-based measure,
Fig. 10 Pedestrian crowd counting results over different time intervals
Fig. 11 Pedestrian crowd counting results, by inferring the number of
pedestrians in each cluster on (s) Mall dataset, and bUCSD dataset
since we have already converted each extracted silhouette
into particles. In practice, a single pedestrian usually con-
sists of a specific number of particles. We first measure the
minimum number of particles required for a single pedes-
trian and then, we propose to use the following Eq. (6)for
inferring the number of pedestrians in each cluster.
Hkpk
´pnk
(6)
where Hkrepresents the total number of pedestrians in cluster
k, while pkis the total number of particles in cluster kand
´pnkis the average number of particles required for a single
pedestrian. Figure 11 shows the pedestrian counts when there
is occlusion, and two humans present in one cluster.
3.4.3 Pedestrian Association
After pedestrian counting, our next goal is to track these
pedestrians. For this purpose, we used Hungarian Algorithm
(HA) to associate people from one frame to another based
onascore[39,40]. This will maintain the identity of each
pedestrian in the scene. In our work, we define the score as
a combination of two matrices. The bounding box centers
distance, and the IoU of the bounding box. The associating
algorithm subsists of the following steps.
Established two empty lists, one for tracking (t-1) and a
detection list t.
123
Arabian Journal for Science and Engineering
Calculate centers distance and IoU score and store it in
a matrix by going through detection and tracking list, we
also used cost function for prioritizing each score.
After that, we run the Hungarian algorithm. This algorithm
searches for the minimum tracking value for each detection
in a matrix using bipartite graph logic, and thus we have
a matrix that depicts the matching among detection and
tracking.
In case of complex occlusion where bounding boxes over-
lap, having two or more matches for a single candidate,
we set max IoU to 1 and all reaming to 0. Also, as we have
a score not cost thus, we replace our 1 with 1 for easy
search of minimum value.
Missing values in our Hungarian matrix are unmatched
detection and unmatched trackers.
Unmatched trackers have a life of 5 frames ahead for again
association else they are removed. For unmatched detec-
tions, we initialized a new tracker for 3 frames if the new
tracker remains after that, then it will active otherwise
removed.
The result of pedestrian association is a set of trackers
each associated with detection, which will become input for
the tracking stage.
3.4.4 Pedestrian Tracking
The goal of our pedestrian tracking model is to establish
the trajectories of all pedestrians in a scene through associa-
tion and position matching in each frame of the video using
a dynamic template matching approach. The dilemma is to
search if and where there is an occurrence or at least a similar
enough occurrence of the template in the target image. For
this purpose, we have used a correlation coefficient as a mea-
sure of similarity between the reference (template) for each
location (x, y) in the target image. The result will be max-
imum for locations where the template has correspondence
(pixel by pixel) to the sub-image located at (x, y).
Our tracking module gets position information from the
detection module and associate pedestrian using HA. This
knowledge is then utilized to extract a reference image when-
ever required from the previous frame, and the module flashes
the same rectangle over the pedestrian if it is found during
searching in the frames captured from that point. Other-
wise, a new template has been generated. Since the templates
for searching are generated dynamically exhibits a dynamic
template matching algorithm. Hence, using data association,
predicted and detected pedestrians are associated and tracked
using normalized cross-correlation as a cost function and
dynamic template matching. In our tracking approach, we
represent each pedestrian as a rectangle along with their cor-
responding Id in the bottom right as depicted in Fig. 12.
Fig. 12 Sample frames of crowd tracking result at different time inter-
vals over UCSD dataset
3.5 Anomaly Detection
Categorization of interaction within crowds into normal
and abnormal requires extraction of robust Spatio-temporal
descriptors along with a strong classifier. Hence, for accurate
anomaly detection, after applying semantic segmentation
(mentioned in subsection 3.3) the extracted silhouettes are
passed through the descriptors extraction step that we opti-
mized using AGA. Finally, the decision is made by the
multilayer Neuro-Fuzzy classifier.
3.5.1 Descriptor Extraction
In this paper, we have introduced multiple distinguish-
able hybrid Spatio-temporal descriptors, including Crowd
shape deformation, silhouettes slicing, Particles Convection,
Energy, and Dominant motion descriptors.
In crowd shape deformation descriptors, we analyzed the
crowd shape deformation over time, for this we first, extract
the crowd contour, by computing the centers of all pedestri-
ans present in the scene and connecting the centers of those
pedestrians who are at the corners of the frame, thus form-
ing a biggest convex hull representing the crowd contour
as depicted in Fig. 13. After crowd contour extraction we
compare the contour of two consecutive frames using the
normalized moment of contour for computing the changes
in crowd shape. Hence, we integrate over all points of the
contour, and compute the central moment, as expressed in
Eq. (7).
Cr,s
n
x
n
y
A(x,y)xxavgryyavgs(7)
123
Arabian Journal for Science and Engineering
Fig. 13 Crowd shape deformation descriptors, anormal frame, babnor-
mal frame
where rand srepresenting the power to which xand ytaken
in the sum, while A(x,y) denotes the intensity of pixels in
coordinate (x,y). The summation is over all of the points
in contour boundary. The above computed moment is like a
central moment except that the values of xand yare average
values. In simple rand smoment, if rand sare both equal
to 0, then the C00 moment is just the length in points of the
contour. However, to compute the normalized moment we
divide all by an appropriate power of C00 as expressed.
Nr,sCr,s
C(r+s)/2+1
00
(8)
where Nr,srepresents the normalized moment. This normal-
ized moment computation of contour was used to compare
two contours of consecutive frames for analyzing the varia-
tions in crowd shape.
We also introduced new robust silhouette slicing descrip-
tors in the time domain. First, we create patches (slices) of
all pedestrian silhouettes present in the scene by finding their
centers and joining the centers to two vertical extreme points
by a line segment, patches are then created in such a way that
their centers lay on the line segment pertinent to the body
of each pedestrian, thus each pedestrian consists of a set of
slices/patches as shown in Fig. 14a.
The total number of slices on each pedestrian depends
upon user choice, however, in our experimentations, we fixed
the count to six because it shows good results. After slicing
we compute the motion changes over time, by first label-
ing each pedestrian and automatically choosing the four best
slices. For the selection of slices, we compute the histogram
of each slice by converting RGB to HSI and those four slices
of each pedestrian were selected automatically whose his-
togram is more equalized as compared to others. After that,
the selected slices were matched with the corresponding can-
didate slices of the same labeled human in the next frame and
computed the motion changes. For slice matching, we take
only the I channel of the HSI histogram and used Hellinger
Fig. 14 Silhouettes slicing descriptors. aSilhouettes slicing, band
cselected histograms of random slices, dunselected histogram of a
random slice
Fig. 15 Particles convection descriptors. aPCD for all human silhou-
ettes, bmagnified view of PCD for two silhouettes
distance between the candidate histogram and the model his-
togram as expressed in Eq. (9).
HDhm
i,hc
i1
2
k
i1hm
ihc
i2(9)
where hm
iand hc
irepresents the probability distribution of
model and candidate histogram, respectively. The model slice
at a frame matched to the candidate slice in the subsequent
frame presents the smallest Hellinger distance.
In Particles Convection Descriptor, we convert each sil-
houette present in the scene into particles such that every
silhouette is represented by a collection of particles, and esti-
mate the interaction force between particles. However, in the
computation of interaction force, we considered only those
particles that present at the silhouettes contour, as shown
in Fig. 15. Generally, in the crowded scenes, pedestrians
have certain goals and destinations and thus have the desired
velocity as well, calculated using bilinear interpolation of the
123
Arabian Journal for Science and Engineering
neighboring flow field vectors using Eq. (10).
Mc
j1wjVxj,yj+wjVmeanxj,yj (10)
where Mc
jis the desired velocity, while Vxj,yjrepresent
the optical flow of particle jand Vmeanxj,yjis the mean
optical flow for particle jin the coordinate (xj,yj), wjis
the panic weight parameter, wj0shows the particular
motion of pedestrian jand wj1 depicts group motion of
pedestrian j. However, in reality, due to the presence of other
pedestrians and obstacles in a crowd the actual motion of
pedestrians would differ from their desired motion calculated
as in Eq. (11).
MjVmeanxj,yj(11)
where Mjis the actual motion of pedestrian j. As interaction
force of the particle with the environment and its neighbor-
ing particles can be calculated as the difference between the
actual velocity of the particle and its desired velocity. Hence,
utilizing the actual motion and desired motion calculated
above we compute the interaction force using Eq. (12).
IF1
Mc
jMjdMj
dt(12)
where IFrepresent the resultant interaction force, while is
the relaxation parameter, and dMj
dtis the change in motion of
pedestrians over time, using this interaction force we observe
the pedestrian’s motion dynamics.
In dominant motion descriptors [41,42], we extract tra-
jectories using a set of feature point tracks expressed as.
xj
d,yj
d,dDj
strt,...,Dj
fnl,{j1, ...,N}(13)
where Nis the total count of point tracks. We used the
Lucas-Kanade tracker for tracking these feature points. Our
purpose is to cluster these point tracks into a dominant pat-
tern of motion. Some point tracks lid only a small portion
of each pedestrian’s motion. However, for dominant motion
trajectories, we used point tracks that have protracted trajec-
tories through the scene along which a hefty count of feature
point tracks exists. We cluster those point tracks that have
the same direction of motion and are spatially close to each
other. For this purpose, we used distance metrics to compare
point tracks based on the longest common subsequence. We
sample new points after every eighth frame and add them to
the tracker as in videos new pedestrians entered with time.
A new cluster is established if another track is found that
is adequately long and antithetical from any existing cluster
center. We update the cluster’s center using the least square
polynomial fit if its size outstrips the fixed value. Hence, to
attain the final dominant motion path, we merged clusters if
their center’s sameness cost exceeds 50%. Steps for dominant
motion descriptors are given in Algorithm 1.
In the energy descriptor, we observe the changes in the
energy-level distribution of all silhouettes present in the
scene, for this we store the movements of human body parts
in the form of energy maps in the energy-based matrix, hav-
ing values between 0 and 8000. After the distribution of
energy values, we extract the energy index values into a 1D
array. Mathematical representation of energy distribution is
expressed in Eq. (14).
Ei
n
0
IdR(v)(14)
where Eirepresents a 1D array, vexpressed index number,
and IdRshows RGB values of v. The energy distribution
of some random frames of the UMN dataset is shown in
Fig. 16 we analyzed the variation in distribution temporally
and used a heuristic thresholding technique to identify the
scenes where energy index values exceed beyond the thresh-
old (Fig. 17).
123
Arabian Journal for Science and Engineering
Fig. 16 Energy descriptors, aNormal frame, babnormal frame
Fig. 17 AGA optimization, aNormal optimal descriptors, babnormal
optimal descriptors
3.5.2 Event Optimization: Adaptive Genetic Algorithm
After descriptor extraction, the extracted descriptors are opti-
mized using an Adaptive Genetic Algorithm (AGA) [43,44].
Descriptor vectors are exhibited to their respective genes, to
alter them into their equivalent chromosomes in such a way
that the gene is a representative of a descriptor in a chromo-
some. The set of these chromosomes is called population.
At first, we randomly create the populations of the chro-
mosomes. After that, the chromosomes of this population
are evaluated through fitness function, which apprised how
effective a chromosome is for an optimal solution. The chro-
mosomes with maximum fitness value are selected and some
genetic operations (crossover and mutation) are applied on
selected chromosomes to form new ones. Crossover is only
performed according to above or average fitness; theadaptive
values are computed expressed as.
Ow1Hmax H/Hmax Havg,H>Havg
w3,HHavg
(15)
where Hmax and Havg are maximum and average fitness of
chromosomes in the active generation. Computed as.
Havg h
j1Hj
h,Hmax maxH1,H2,H3,...,Hf
However, for mutation, we set the probability to low and used
Eq. (16) for the mutation process.
Mw2Hmax H/Hmax Havg,H>Havg
w4,HHavg
(16)
where w1,w2,w3and w4are weight parameters. If the current
fitness value is higher than the previous one, then we replace it
with the new one. The process will only be finished when cri-
teria are met. The purpose is to evolve these chromosomes by
generating better ones until optimal descriptors are obtained.
The working mechanism of AGA is depicted in Algorithm
2.
3.5.3 Classifier: Multilayer Neuro-Fuzzy
The multi-fused optimized descriptors from AGA are fed
to a multilayer Neuro-Fuzzy Classifier for decision making.
In NFC, we fused the learning ability of neural networks
and knowledge representation of fuzzy logic. This fusion
makes the classifier less dependent on expert knowledge and
more systematic. Our NFC consists of five layers namely,
the Input layer, fuzzification layer, Fuzzy rules layer, Output
membership functions layer, and defuzzification layer. The
role of each layer is described as follows:
In Layer 1: Input layer, the extracted optimized descriptors
are fed to multilayer NFC, which has two inputs x and y,
where x represents the spatial descriptors (A1, A2) and y
represent the temporal descriptors (B1, B2, B3).
In Layer 2: Fuzzification layer, the crisp descriptor values are
transformed into fuzzy values using membership functions
as depicted in Eq. (17).
O2
iμfi(x), where i1, 2, 3 (17)
where O2
irepresents the output of layer 2, and μfi(x)can
accept any fuzzy membership function. In our work, we used
123
Arabian Journal for Science and Engineering
Fig. 18 Accuracy of multilayer NFC with a different number of fuzzy
rules
a combination of three membership functions namely: R,
triangular, and L to create the intersection points for the
respective NFC inputs, the regions under the membership
function are called fuzzy regions. Membership values that
come purely under R fuzzy region are considered normal
and those that come purely under the L region are predicted
as abnormal.
In fuzzy rule layer: Layer 3, fuzzy rules were defined to
evaluate the fuzzy values. A sample of fuzzy rules is shown
in Eq. (18).
Rule N:IFO2
iis tN
i,THENZNw3
i(18)
where tN
iis the threshold value and w3
iis the weight assigned
by layer 3. In our experimentations, we used seven if–then
fuzzy rules. The selection of seven rules is because of their
higher accuracy rate. Figure 18 shows the accuracy of our
system with different choices of rules.
Output membership function layer: Layer 4, provides the
output membership values which are then normalized for the
next layer.
In defuzzification layer: Layer 5, the defuzzification is
done by performing the MAX operation. Figure 19 depicts
the architecture of NFC.
4 Experimental Setup and Results
This section elaborates the detail of all experiments per-
formed to validate the proposed model. All processing and
experimentations were performed on MATLAB and Google
Colab (Python) tools. The hardware system used was Intel
Core i5-6200U with 2.40 GHz processing power, 8 GB
RAM, 2 GB dedicated graphics card Nvidia 920 M having
×64-based Windows 10 pro. We segregate the experiments
into three parts. In the first part, we evaluate the perfor-
mance of crowd counting and tracking using UCSD and
R1
R2
R3
R5
R6
R7
x
y
R4
Z
Layer 3
Layer 2
Layer 4
Layer 5
Layer 1
Fig. 19 Architecture of proposed multilayer neuro-fuzzy classifier for
anomaly detection
Mall datasets. In the second part, we evaluate the perfor-
mance of Anomaly detection with UMN and MED datasets.
Lastly, in the third part, we compare our proposed model
with other state-of-the-art methods. This section is further
split into three subsections: dataset description, performance
matric with results, and discussion.
4.1 Datasets Description
For Crowd tracking, we used two datasets: UCSD and Mall
datasets. However, the UMN crowd and MED Datasets are
used for Anomaly detection. Details of each of the datasets
are mentioned in the following subsections.
4.1.1 UCSD Pedestrian Dataset
UCSD dataset [45], is a 2000-frame video dataset containing
videos of pedestrians, captured from a stationary camera on
UCSD walkways. The average crowd count for the UCSD
dataset is 24.8 in one frame. The videos were recorded with
10 frames per second having a video resolution of 328 ×158.
4.1.2 Mall Pedestrian Dataset
The Mall pedestrian’s dataset [46], consists of 2000 frames of
pedestrians, captured inside a shopping mall using a publicly
accessible surveillance camera. The average crowd count for
Mall dataset is 31.2 in one frame. The videos were recorded
123
Arabian Journal for Science and Engineering
Table 3 Measurements of MAE
and MSE for pedestrian counting
over UCSD pedestrian and mall
datasets
Model UCSD MALL
MAE MSE MAE MSE
Proposed model 1.69 2.09 2.57 4.34
Bold values indicate the mean values
Mean Accuracy = 96.69%
Fig. 20 Confusion matrix for semantic segmentation accuracy over
UCSD dataset. *car car, sdk sidewalk, hmn human, bcy bicycle,
grass grass, sky sky, tree tree, sktb skateboard, road road
Mean Accuracy = 96.03%
Fig. 21 Confusion matrix for semantic segmentation accuracy over
UMN dataset. *car car, sdk sidewalk, hmn human, bcy bicycle,
grass grass, sky sky, tree tree, sktb skateboard, road road
with < 2 frames per second (fps) having a video resolution
of 640 ×480.
4.1.3 UMN Dataset
UMN dataset [47] was captured at the University of Min-
nesota. The dataset contains 11 videos, with three different
scenes, specifically, one indoor and two outdoor scenes.
There are six scenarios in the indoor scene, with a total of
4144 frames. However, two outdoor scenes, the plaza scene,
with three scenarios and 2142 frames, and the lawn scene,
with two scenarios and 1453 frames. Each video in UMN
dataset starts with a normal scene and ends with sequences
of abnormal crowd behavior.
4.1.4 MED Dataset
MED dataset [48] consists of videos recorded using an immo-
bile video camera elevated at height with a video resolution
of 554 ×235. Each video in MED begins with a normal scene
and terminates on abnormal once, with different crowd den-
sities changing from sparse to very crowded.
4.2 Performance Metrics and Results
We used seven evaluation metrics to measure the perfor-
mance of our proposed model. For evaluating the perfor-
mance of crowd counting we used two universal quantitative
metrics i.e. Mean Square Error (MSE) and Mean Absolute
Error (MAE).
MAE 1
T
T
v1|GvHv|(19)
MSA 1
T
T
v1
(GvHv)2(20)
where Tis the total number of testing frames, while Gvand
Hvare the predicted and ground-truth count of pedestrians,
respectively. However, the performance of pedestrian track-
ing and crowd anomaly detection was evaluated with five
evaluation metrics i.e. Accuracy, Precision, Recall, F1score,
and confusion matrix [49,50].
Accuracy TP + TN
TP + TN + FP + FN (21)
Precision TP
TP + FP (22)
Recall TP
TP + FN (23)
123
Arabian Journal for Science and Engineering
Table 4 Measurements of
accuracy, recall and F1score for
crowd tracking over UCSD
pedestrian dataset
Sequence no (frame 70) GROUND
TRUTH
TP FN FP Accuracy Recall F1score
49900111
9 14 13 1 0 0.928 0.928 0.962
16 17 15 2 0 0.882 0.882 0.937
24 21 19 2 0 0.904 0.904 0.949
30 25 22 3 0 0.880 0.880 0.936
Mean accuracy 91.8%
Table 5 Measurements of
accuracy, recall and F1score for
crowd tracking over mall
pedestrian dataset
Sequence no (frame 70) Ground truth TP FN FP Accuracy Recall F1score
5 11 10 1 0 0.909 0.909 0.952
11 15 14 1 0 0.933 0.933 0.965
17 21 19 2 0 0.904 0.904 0.949
22 26 22 4 0 0.846 0.846 0.916
30 30 26 4 0 0.866 0.866 0.928
Mean accuracy 89.16%
Mean Accuracy of Event Detection = 93.5%
Fig. 22 Confusion matrix with mean accuracy for crowd anomaly detec-
tion on UMN dataset
Table 6 Measurements of precision, recall and F1score for crowd
anomaly detection over UMN dataset
Events Precision Recall F1score
Normal 0.951 0.98 0.964
Abnormal 0.979 0.95 0.963
Average 0.965 0.965 0.963
Bold values indicate the mean values
F1score 2×Precision ×Recall
Precision + Recall (24)
where TP represents true positive, TN is a true negative, FP
is false positive, and FN is false negative, respectively. How-
ever, Percentage of the total count of accurate classification is
referred to as accuracy, whereas precision refers to the close-
ness of the measurements to each other, while the percentage
Mean Accuracy of Event Detection = 92.5%
Fig. 23 Confusion matrix with mean accuracy for crowd anomaly detec-
tion on MED dataset
Table 7 Measurements of precision, recall and F1score for crowd
anomaly detection over MED dataset
Events Precision Recall F1score
Normal 0.923 0.96 0.941
Abnormal 0.958 0.92 0.938
Average 0.940 0.94 0.939
Bold values indicate the mean values
of real positive that are classified as anomalous is referred to
as recall and the F1score is the measure of test accuracy. First,
we evaluate the performance accuracy of semantic segmenta-
tion for multi-object detection, Figs. 1and 2show confusion
matrices for segmentation accuracy over UCSD and UMN
datasets.
123
Arabian Journal for Science and Engineering
Table 8 Comparative analysis for crowd counting with other state-of-
the-art methods in terms of mean absolute error (MAE) and mean square
error (MSE) over UCSD and MALL datasets
Methods UCSD MALL
MAE MSE MAE MSE
MSHD [51] 2.10 6.05 2.90 14.04
Geometric head detection [52] 2.05 4.93 4.09 14.9
ST-CNN [53] 4.03 5.87
DigCrowd [54]3.2116.4
AMS-CNN (FV-3) [55] 2.46 3.32 2.94 3.63
FSAC-Meta learning [56] 3.08 4.16 2.44 3.12
DIR-CDCC [57] 1.79 2.47 2.36 3.12
Proposed model 1.69 2.09 2.57 4.34
Bold values indicate the mean values
Table 9 Accuracy comparison of the proposed approach to stat-of-the-
art crowd tracking methods over UCSD dataset
Models Average accuracy (%)
DDPMO [58] 81.3
MHT [59] 65.3
DCEM [60] 77.9
TBC [61] 82.1
PGM based IEC model [62] 86.9
Proposed model 91.8
Bold value indicates the mean values
4.2.1 Experiment 1: Crowd Counting and Tracking Over
UCSD and Mall Datasets
We evaluate the performance of our proposed crowd count-
ing and tracking system on two publicly available benchmark
datasets, i.e., UCSD and Mall datasets. For evaluating
the classification accuracy of NFC, the experiments were
rehearsed three times on testing sets for each dataset inde-
pendently. Table 3shows MAE and MSE for our proposed
pedestrian crowd counting system over UCSD and Mall
datasets (Figs. 20,21).
Tables 4and 5represent the mean accuracy along with
recall and F1score for the crowd tracking system over UCSD
and Mall datasets for 30 sequences, whereas one sequence
consists of 70 frames.
4.2.2 Experiment 2: Crowd Anomaly Detection Over UMN
and MED Datasets
In this experiment, we evaluate the performance of our crowd
anomaly detection system using confusion matrix, preci-
sion, recall, and F1score over UMN and MED benchmark
datasets. Figure 22 depicts the confusion matrix and Table
6shows the performance measurements for crowd anomaly
detection over the UMN dataset for the first 30 sequences.
Figure 23 shows the confusion matrix for anomaly detec-
tion and Table 7represents the performance measures for
anomaly detection over Motion Estimation Dataset (MED).
4.2.3 Experiment 3: Crowd Counting, Tracking and Anomaly
Detection Comparison with Stat-of-the-art Methods
In this experiment, we compare our proposed system with
other well-known methods. Table 8shows a comparison of
our proposed crowd counting system with other state-of-the-
art methods.
In Table 9, a comparison between our introduced crowd
tracking system and other state-of-the-art systems shows that
our system achieved a higher accuracy rate as compared to
existing crowd tracking methods.
Table 10 represents a comparison of our proposed crowd
anomaly detection system with other state-of-the-art systems
on UMN dataset with two outdoor scenes (scenes 1 and 3) and
one indoor scene (scene 2). As depicted our system achieved
Table 10 Accuracy comparison
of the proposed approach to
state-of-the-art crowd anomaly
detection methods over UMN
dataset
Methods Scene 1 Scene 2 Scene 3 Overall accuracy (%)
Cell Structure [63] 88.3
LECM-2 [64] 98.5 87.9 94.6 93.01
CLMR (modified-GT) [65] 98.7 93.7 97.1 96.5
2STG-AKSVD2[66] 89.80 76.42 89.50 85.20
MU-LSTM [67] 97949394.6
Tracklet Clustering [68]97899894.3
MMAV [69] 90.0
PGM based IEC Model [62] 87.43 83.21 90.63 86.06
Proposed model 99.23 95.56 97.89 96.59
Bold value indicates the mean values
123
Arabian Journal for Science and Engineering
a higher accuracy rate as compared to the other well-known
existing methods.
5 Conclusion
In this article, we introduced the idea of semantic seg-
mentation for foreground extraction. We used clustering
techniques and dynamic template matching for crowd count-
ing and tracking. However, for anomaly detection, new
robust Spatio-temporal descriptors are extracted and opti-
mized using AGA and passed to multilayer NFC for anomaly
detection. Through detailed experimentations, we efficiently
proved the ability of our proposed system in a crowded envi-
ronment. The accuracy of our tracking system goes down
marginally in case of dense crowd, which is mainly because
of full occlusions that occurred during test videos. We evalu-
ate the performance of our proposed model on four publically
available benchmark datasets and achieved superior accuracy
rates as compared to other existing state-of-the-art systems.
The proposed system can be deployed to great benefit in
various public places, such as political rallies, public celebra-
tions, airports, train stations, and shopping malls to control,
protect and supervise the crowd.
In the future, we aim to work on more complex crowd
environments and solve the occlusion problem by introduc-
ing some new occlusion reasoning methods. Furthermore,
we have a plan to extend our work to recognition of different
scenes like riots or chaotic acts, fights, sports, robbery, and
road accident scenes.
Author Contributions Conceptualization, F.A., and A.J.; methodology,
F.A., and A.J.; software, F.A.; Validation, F.A., and A.J.; formal anal-
ysis, F.A., and A.J.; resources, A.J.; writing-review and editing, F.A.,
and A.J.: Authors have read and agreed to the published version of the
manuscript.
Data Availability Data sharing is not applicable.
Declarations
Conflict of interest The authors declare no conflict of interest.
References
1. Mo, H.; Wu, W.: Background noise filtering and distribution divid-
ing for crowd counting. IEEE Trans. Image Process. 29, 8199–8212
(2020)
2. Rezaei, K.; Mohini, M.K.: A survey on deep learning-based
real-time crowd anomaly detection for secure distributed video
surveillance. In: Personal and Ubiquitous Computing, pp. 1–17.
Springer (2021)
3. Bal Sundaram, A.; Chaliapin, C.: An intelligent video analytics
model for abnormal event detection in online surveillance video. J.
Real-Time Image Process. 17, 915–930 (2020)
4. Jalal, A.; Batool, M.: Sustainable wearable system: human behav-
ior modeling for life-logging activities using K-ary tree hashing
classifier. Sustainability 12, 10324 (2020)
5. Wang, Q.; Yuan, Y.: Pixel-wise crowd understanding via synthetic
data. Int. J. Comput. Vis. 129, 225–245 (2021)
6. Nida, K.; Yazeed, G.; Jalal, A.: Semantic recognition of human-
object interactions via Gaussian-based elliptical modeling and
pixel-level labeling. IEEE Access 6, 66 (2021)
7. Tripathi, G.; Vishwakarma, D.K.: Convolutional neural networks
for crowd behavior analysis: a survey. Vis. Comput. 35, 753–776
(2019)
8. Gochoo, M.; Jalal, A.: Monitoring real-time personal locomotion
behaviors over smart indoor-outdoor environments via body-worn
sensors. IEEE Access 6, 66 (2021)
9. Lentz’s, A.; Vrakas, D.: Non-intrusive human activity recognition
and abnormal behavior detection on elderly people. Artif. Intell.
Rev. 53, 1975–2021 (2020)
10. Nadeem, A.; Jalal, A.: Human actions tracking and recognition
based on body parts detection via artificial neural network. In: Pro-
ceedings of the ICACS, pp. 1–6. IEEE (2020)
11. Akhter, I.; Jalal, A.; Kim, K.: Adaptive pose estimation for gait
event detection using context-aware model and hierarchical opti-
mization. Proc. EE&T 16, 2721–2729 (2021)
12. Grant, J.M.; Flynn, P.J.: Crowd scene understanding from video: a
survey. ACM Trans. Multimedia Comput. Commun. Appl. TOMM
13(2), 1–23 (2017)
13. Gochoo, M.; Jalal, A.: Stochastic remote sensing event classifi-
cation over adaptive posture estimation via deep belief network.
Remote Sens. 13, 912 (2021)
14. Al-Shaery, A.M.; Khozium, M.O.: In-depth survey to detect, mon-
itor and manage crowd. IEEE Access 8, 209008–209019 (2020)
15. Mahmood, M.; Jalal, A: Robust spatio-temporal features for human
interaction recognition via artificial neural network. In: Proceed-
ings of the FIT, pp. 218–223. IEEE (2018)
16. Nadeem, A.; Jalal, A.: Automatic human posture estimation for
sports activity recognition with robust body parts detection. Mul-
timedia Tools Appl. 80, 21465–21498 (2021)
17. Xu, T.; Wang, W.: Crowd counting using accumulated HOG. In:
Proceedings of the ICNC-FSKD, pp. 1877–1881. IEEE (2016)
18. Jalal, A.; Mahmood, M.: Multi-features descriptors for human
activity tracking and recognition in Indoor-outdoor environments.
In: Proceedings of the IBCAST, pp. 371–376. IEEE (2019)
19. Wan, J.; Chan, A.: Adaptive density map generation for crowd
counting. In: Proceedings of the CVF, pp. 1130–1139. IEEE (2019)
20. Pervaiz, M.; Jalal, A.: Hybrid algorithm for multi people counting
and tracking for smart surveillance. In: Proceedings of the IBCAST,
pp. 530–535 (2021)
21. Zhang, X.; Zhang, L.: Real-time crowd counting with human detec-
tion and human tracking. In: Proceedings of the on NIP, pp. 1–8.
Springer. Cham (2014)
22. Khan, S.D.; Cheikh, F.A.: Disam: Density independent and scale
aware model for crowd counting and localization. In: Proceedings
of the ICIP, pp. 4474–4478. IEEE (2019)
23. Gochoo, M.; Jalal, A.: A systematic deep learning-based overhead
tracking and counting system using RGB-D remote cameras. Appl.
Sci. 11, 5503 (2021)
123
Arabian Journal for Science and Engineering
24. Pervaiz, M.; Jalal, A.: Smart surveillance system for people
counting and tracking using particle flow and modified SOM. Sus-
tainability 13, 5367 (2021)
25. Merad, D.; Drap, P.: Tracking multiple persons under partial and
global occlusions: application to customers’ behavior analysis. Pat-
tern Recogn. Lett. 81, 11–20 (2016)
26. Chahyati, D.; Arymurthy, A.M.: Multiple human tracking using
Retinanet features, Siamese neural network, and Hungarian algo-
rithm. In: Proceedings of the IAEME, vol. 10, pp. 465-475 (2020)
27. Pradeepa, B.; Vaidehi, V.: Anomaly detection in crowd scenes using
streak flow analysis. In: Proceedings of the WiSPNET, pp. 363–368
(2019)
28. Kim, K.; Jalal, A.: Vision-based human activity recognition system
using depth silhouettes. J. Electr. Eng. Technol. 14, 2567–2573
(2019)
29. Zhang, X.; Stevens, B.: Scene perception guided crowd anomaly
detection. Neurocomputing 414, 291–302 (2020)
30. Khan, M.U.K.; Kyung, C.M.: Rejecting motion outliers for efficient
crowd anomaly detection. IEEE Trans. IFS 14, 541–556 (2018)
31. Shehzad, A.; Jalal, A.; Kim, K.: Multi-person tracking in smart
surveillance system for crowd counting and normal/abnormal
events detection. In: Proceedings of the ICAEM, pp. 163–168.
IEEE (2019)
32. Yimin, D.O.U.; Wei, C.: Abnormal behavior detection based on
optical flow trajectory of human joint points. In: Proceedings of
the CCDC, pp. 653–658. IEEE (2019)
33. Nawaratne, R.; Yu, X.: Spatiotemporal anomaly detection using
deep learning for real-time video surveillance. IEEE Trans. Ind.
Inf. 16, 393–402 (2019)
34. Chen, T.; Chen, H.: Anomaly detection in crowded scenes using
motion energy model. Multimedia Tools Appl. 77, 14137–14152
(2018)
35. Khalid, N.; Jalal, A.; Kim, K.: Modeling two-person segmentation
and locomotion for stereoscopic action identification. Sustainabil-
ity 13, 970 (2021)
36. Minaee, S.; Terzopoulos, D.: Image segmentation using deep learn-
ing: a survey. In: Proceedings of the TPAMI. IEEE (2021)
37. Jalal, A.; Ahmed, A.; Kim, K.: Scene semantic recognition based
on modified fuzzy C-mean and maximum entropy using object-to-
object relations. IEEE Access 9, 27758–27772 (2021)
38. Rafique, A.A.; Jalal, A.: Statistical multi-objects segmentation for
indoor/outdoor scene detection and classification via depth images.
In: Proceedings of the IEEE, pp. 271–276. IBCAST (2020)
39. Sahbani, B.; Adiprawita, W.: Kalman filter and iterative-hungarian
algorithm implementation for low complexity point tracking as
part of fast multiple object tracking system. In: Proceedings of the
ICSET, pp. 109–115. IEEE (2016)
40. Jalal, A.; Sarif, N.; Kim, T.S.: Human activity recognition via
recognized body parts of human depth silhouettes. Indoor Built
Environ. 22, 271–279 (2013)
41. Alzahrani, A.J.; Ullah, H.: Anomaly detection in crowds by fusion
of novel feature descriptors. J. Eng. Manag. Technol. 11, 11A16B-1
(2020)
42. Jalal, A.; Kim, Y.; Kim, D.: Ridge body parts features for human
pose estimation and recognition from RGB-D video data. In: Pro-
ceedings of the ICCCNT, pp. 1–6. IEEE (2014)
43. Zhu, H.D.; Zhong, Y.: Feature selection method by applying par-
allel collaborative evolutionary genetic algorithm. J. Electr. Sci.
Technol. 8, 108–113 (2010)
44. Jalal, A.; Lee, S.; Kim, T.S.: Human activity recognition via the
features of labeled depth body parts. In: Proceedings of the Smart
Homes and Health Telematics, pp. 246–249. Springer (2012)
45. Chan, A.B.; Vasconcelos, N.: Modeling, clustering, and segment-
ing video with mixtures of dynamic textures. In: Proceedings of
the IEEE TPAMI, vol. 30, pp. 909–926 (2008)
46. Chen, K.; Xiang, T.: Feature mining for localised crowd counting.
In Bmvc. 1, 3 (2012)
47. Mehran, R.; Shah, M.: Abnormal crowd behavior detection using
social force model. In: Proceedings of the CVPR, pp. 935–942
(2009)
48. Rabiee, H.; Murino, V.: Novel dataset for fine-grained abnormal
behavior understanding in-crowd. In: Proceedings of the AVSS,
pp. 95–101. IEEE (2016)
49. Chriki, A.; Kamoun, F.: Deep learning and handcrafted features
for one-class anomaly detection in UAV video. Multimedia Tools
Appl. 80, 2599–2620 (2021)
50. Abdulhussain, S.H.; Mahmmod, B.M.; Saripan, M.I.; Al-Haddad,
S.A.R.; Baker, Flayyih, W.N.; Jassim, W.A.: A fast feature extrac-
tion algorithm for image and video processing. In: Proceedings of
the IJCNN, pp. 1–8 (2019)
51. Ma, T.; Li, N.: Scene invariant crowd counting using multi-
scales head detection in video surveillance. IET Image Proc. 12,
2258–2263 (2018)
52. Miao, Y.; Zhang, B.: ST-CNN: spatial–temporal convolutional neu-
ral network for crowd counting in videos. Pattern Recogn. Lett. 125,
113–118 (2019)
53. Xu, M.; Xu, C.: Depth information guided crowd counting for com-
plex crowd scenes. Pattern Recogn. Lett. 125, 563–569 (2019)
54. Saqib, M.; Blumenstein, M.: Crowd counting in low-resolution
crowded scenes using region-based deep convolutional neural net-
works. IEEE Access 7, 35317–35329 (2019)
55. Pandey, A.; Trivedi, A.: KUMBH MELA: a case study for
dense crowd counting and modeling. Multimedia Tools Appl. 79,
17837–17858 (2020)
56. Reddy,M.K.K.; Wang, Y.: Few-shot scene adaptive crowdcounting
using meta-learning. In: Proceedings of the CVF, pp. 2814–2823.
IEEE (2020)
57. He, Y.; Gong, Y.: Error-aware density isomorphism reconstruction
for unsupervised cross-domain crowd counting. In: Proceedings of
the AAAI (2021)
58. Neiswanger, W.; Xing, E.: The dependent Dirichlet process mix-
ture of objects for detection-free tracking and object modeling. In:
Proceedings of the Artificial Intelligence Statistics, pp. 660–668
(2014)
59. Kim, C.; Rehg, J.M.: Multiple hypothesis tracking revisited. In:
Proceedings of the CV, pp. 4696–4704. IEEE (2015)
60. Milan, A.; Roth, S.: Multi-target tracking by discrete-continuous
energy minimization. IEEE TPAMI 38, 2054–2068 (2016)
61. Ren, W.; Chan, A.B.: Tracking-by-counting: using network flows
on crowd density maps for tracking multiple targets. IEEE Trans.
Image Proc. 30, 1439–1452 (2020)
62. Abdullah, F.; Gochoo, M.; Jalal, A.: Multi-person tracking and
crowd behavior detection via particles gradient motion descriptor
and improved entropy classifier. Entropy 23, 628 (2021)
63. Leyva, R.; Li, C.T.: Video anomaly detection with compact fea-
ture sets for online performance. IEEE Trans. Image Process. 26,
3463–3478 (2017)
64. Sezer, E.S.; Can, A.B.: Anomaly detection in crowded scenes using
log-Euclidean covariance matrix. In: Proceedings of the VISI-
GRAPP, pp. 279–286 (2018)
65. Patil, N.; Biswas, P.K.: Global abnormal events detection in
crowded scenes using context location and motion-rich spatio-
temporal volumes. IET Image Proc. 12, 596–604 (2018)
123
Arabian Journal for Science and Engineering
66. Ege, C.Ö.: Two-Stage Sparse Representation Based Abnormal
Crowd Event Detection in Videos. University of Helsinki (2020)
67. Moustafa, A.N.; Gomaa, W.: Gate and common pathway detec-
tion in crowd scenes and anomaly detection using motion
units and LSTM predictive models. Multimedia Tools Appl. 79,
20689–20728 (2020)
68. Hassanein, A.S.; Yagi, Y.: Identifying motion pathways in highly
crowded scenes: a non-parametric tracklet clustering approach.
Comput. Vis. Image Underst. 191, 102710 (2020)
69. Rehman, A.U.; Mahmood, T.; Khan, H.O.A.: Multi-modal
anomaly detection by using audio and visual cues. IEEE Access 9,
30587–30603 (2021)
Springer Nature or its licensor holds exclusive rights to this article
under a publishing agreement with the author(s) or other rightsholder(s);
author self-archiving of the accepted manuscript version of this article
is solely governed by the terms of such publishing agreement and appli-
cable law.
123
... Unlike video datasets used in previous studies to classify human behavior, video datasets targeting the real world, such as CCTV, include several crowds of people, and it is challenging to predict human movement or interactions between humans and crowds 9,10 . Therefore, recognizing or classifying crowd behavior and understanding the situation in a crowd are challenging because of the similar and complex interactions between humans and crowds in the real world [11][12][13][14][15][16][17][18][19] . Consequently, using newly proposed video datasets and models, recent studies have focused on accurately classifying crowd behaviors or situations. ...
... In this study, we introduced a novel three-dimensional (3D) CNN-based model for crowd behavior classification, called the three-dimensional atrous inception module (3D-AIM) network, to improve the accuracy of video surveillance systems. The proposed 3D-AIM network employs an atrous convolution [14][15][16]24 instead of a www.nature.com/scientificreports/ typical convolution, which was used in previous human behavior classification models. ...
Article
Full-text available
Recent advances in deep learning have led to a surge in computer vision research, including the recognition and classification of human behavior in video data. However, most studies have focused on recognizing individual behaviors, whereas recognizing crowd behavior remains a complex problem because of the large number of interactions and similar behaviors among individuals or crowds in video surveillance systems. To solve this problem, we propose a three-dimensional atrous inception module (3D-AIM) network, which is a crowd behavior classification model that uses atrous convolution to explore interactions between individuals or crowds. The 3D-AIM network is a 3D convolutional neural network that can use receptive fields of various sizes to effectively identify specific features that determine crowd behavior. To further improve the accuracy of the 3D-AIM network, we introduced a new loss function called the separation loss function. This loss function focuses the 3D-AIM network more on the features that distinguish one type of crowd behavior from another, thereby enabling a more precise classification. Finally, we demonstrate that the proposed model outperforms existing human behavior classification models in terms of accurately classifying crowd behaviors. These results suggest that the 3D-AIM network with a separation loss function can be valuable for understanding complex crowd behavior in video surveillance systems.
... Approach Techniques Dataset [27] Deep learning DCNN UCSD, CUHK, ShanghaiTech [28] Deep Learning DTM technique UCSD, Mall, UMN and MED [29] Deep Learning Neural Network models Custom dataset [30] Deep Learning SFE technique TUT 2016 [31] Deep Learning and Bio-Inspired CRN along with AntHocNet Custom dataset [32] Deep Learning IIN UCF-Crime, UCSD [33] AI DCNN Custom dataset [34] Deep Learning DCNN methods Seven benchmark datasets [35] Deep Learning DNN UCF-Crime [36] Deep Learning LMNN Custom dataset ...
... Now, anomalous prediction is a sub-field of behavior learning from captured visual scenes. Machine learning (ML) refers to a domain of artificial intelligence (AI) that utilizes statistical approaches to learn hidden patterns from prevailing data and to make decision regarding unobserved records (Abdullah and Jalal, 2023). The core task of machine learners is constructing a general model on the possible dispersal of training samples and generalizing experience to unseen instances. ...
Article
Full-text available
Anomaly detection in pedestrian walkways of visually impaired people (VIP) is a vital research area that utilizes remote sensing and aids to optimize pedestrian traffic and improve flow. Researchers and engineers can formulate effective tools and methods with the power of machine learning (ML) and computer vision (CV) to identifying anomalies (i.e. vehicles) and mitigate potential safety hazards in pedestrian walkways. With recent advancements in ML and deep learning (DL) areas, authors have found that the image recognition problem ought to be devised as a two-class classification problem. Therefore, this manuscript presents a new sine cosine algorithm with deep learning-based anomaly detection in pedestrian walkways (SCADL-ADPW) algorithm. The proposed SCADL-ADPW technique identifies the presence of anomalies in the pedestrian walkways on remote sensing images. The SCADL-ADPW techniques focus on the identification and classification of anomalies, i.e. vehicles in the pedestrian walkways of VIP. To accomplish this, the SCADL-ADPW technique uses the VGG-16 model for feature vector generation. In addition, the SCA approach is designed for the optimal hyperparameter tuning process. For anomaly detection, the long short-term memory (LSTM) method can be exploited. The experimental results of the SCADL-ADPW technique are studied on the UCSD anomaly detection dataset. The comparative outcomes stated the improved anomaly detection results of the SCADL-ADPW technique.
... While instances of human sites considered in this study include homes, gyms, beaches, classrooms, schools, and meetings. A wide range of applications, including those for smart homes [21][22][23], lifelogging [24][25][26][27], indoor localization [28][29][30], healthcare [31][32][33][34], rescue [35], fitness [36], surveillance [37][38], and entertainment [39], can benefit from the correct recognition of these activities and locations. Let's use the example of a person who needs to have his location determined while in a large, crowded mall. ...
Article
Full-text available
The localization and recognition of human activities are important areas for research in ubiquitous computing. Smartphone and other mobile device built-in sensors are a hot topic of research due to the rise in their use. Applications like healthcare monitoring, behavior analysis, personal safety, and entertainment can all be made possible by recognizing human activity and tracking its position. Signal noise, cellphones' flexible orientation and positioning, and the accuracy of anticipated activities and locations are some of the issues this research area faces. The suggested solution makes use of a strong architecture with two parallel branches—one for localization and the other for human activity recognition—to address these issues. Features including the maximum Lyapunov exponent, fractal dimension, mel-frequency-cepstral coefficients, and other statistical features are retrieved in the human activity recognition branch. A recursive feature elimination algorithm chooses the optimal feature combination, which is then passed to a genetic algorithm for data augmentation. For activity classification, the data is then processed via a deep neural decision forest. Stride length, step count, heading angle, and other statistical features are included in the localization branch. These features are subjected to a recursive feature elimination, a genetic algorithm, and classification by a deep neural decision forest classifier in a similar manner. The proposed system, which was trained and evaluated on three benchmark datasets, outperformed cutting-edge technologies with mean accuracies of 94.58%, 90.90%, and 90.40%, respectively, over the Sussex Huawei Locomotion dataset, the Extrasensory dataset, and the Continuous in-the-wild smartwatch activity dataset.
... Fig. 3 illustrates the sample frames for super pixel generation. Following super-pixels' creation, a conditional random field (CRF) is carried out in this stage for doling out labels to every pixel [66][67][68][69]. A CRF is a discriminative factual demonstrating strategy that is utilized when the class labels for various variables are not independent. ...
Article
Full-text available
The world’s expanding populace, the variety of human social factors, and the densely populated environment make humans feel uncertain. Individuals need a safety officer who generally deals with security viewpoints for this frailty. Currently, human monitoring techniques are time-consuming, work concentrated, and incapable. Therefore, autonomous surveillance frameworks are necessary for the modern day since they are able to address these problems. Nevertheless, hardships persist. The central concerns incorporate the detachment of the foreground from the scene and the understanding of the contextual structure of the environment for efficiently identifying unusual objects. In our work, we introduced a novel framework to tackle these difficulties by presenting a semantic segmentation technique for separating a foreground object. In our work, Super-pixels are generated using an improved watershed transform and then a conditional random field is implemented to obtain multi-object segmented frames by performing pixel-level labeling. Next, the Social Force model is introduced to extract the contextual structure of the environment via the fusion of a novel chosen particular histogram of an optical stream and inner force model. After using the computed social force, multi-people tracking is performed via three-dimensional template association using percentile rank and non-maximal suppression. Next, multi-object categorization is performed via deep learning Feature Pyramid Network. Finally, by considering the contextual structure of the environment, Jaccard similarity is utilized to make the decision for abnormality detection and identify the unusual objects from the scene. The invented framework is verified through rigorous investigations, and it obtained multi-people tracking efficiency of 92.2% and 89.1% over the UCSD and CUHK Avenue datasets. However, 95.2% and 93.7% abnormality detection efficiency is accomplished over UCSD and CUHK Avenue datasets, respectively.
... Falling and the D06, D13, D18, and D19 tasks were not performed by elderly adults. We used two accelerometer models: Adxl345, which had a resolution of 13 bits and a range of 16g to +16g, and MA8451Q, which had a resolution of 14 bits and a range of 8g to +8g [20]. The subjects' waists were attached to the accelerometer gadget as demonstrated in Fig. 11. ...
Article
Full-text available
In the fields of body-worn sensors and computer vision, current research is being done to track and detect falls and activities of daily living using the automatic recognition of human actions. In the area of human–machine communication, different combinations of sensors and communication technologies are often used to capture human action. Many researchers have also worked with artificial intelligent systems to detect actions, understand scenes, and implement systems that are more efficient in human action recognition. Although effective approaches are needed to detect outdoor activities with the combination of human actions, feature extraction can be quite a complicated task in a human activity recognition system development. Thus, this paper proposed a solution to detect human activities via hybrid descriptors based on robust features and accurate results. In this study, complex backgrounds, including multiple humans in video frames, were detected. First, inertial signal and video frames are pre-processed using denoising techniques, after which the frames are used to remove the background by detecting human motions and extracting the silhouettes. Then, these silhouettes are further used to extract the human body key points to make the human skeleton. Then the time and frequency domain features are extracted for inertial signals, and geometric features are extracted for the skeleton body points. Finally, multiple feature sets are combined and fed into a zero order optimization model, after which logistic regression is utilized to recognize each action. The proposed system has been evaluated on three benchmark datasets, including, the UP Fall dataset, the University of Rzeszow Fall dataset, and the SisFall dataset and proved its significance by achieving accuracy of 91.51%, 92.98%, and 90.23%, on the aforementioned datasets respectively.
Chapter
Full-text available
With the increasing number of crimes in crowded and remote areas, there is a necessity to recognize any abnormal or violent event with the help of video surveillance systems. Anomaly detection is still a challenging task in the domain of computer vision because of its changing color, backgrounds, and illuminations. In recent years, vision transformers, along with the introduction of attention modules in deep learning algorithms showed promising results. This paper presents an attention-based anomaly detection framework that focuses on the extraction of spatial features. The proposed framework is implemented in two steps. The first step involves the extraction of spatial features with the Spatial Attention Module (SAM) and Shifted Window (SWIN) transformer. In the second step, a binary classification of abnormal or violent activities is done with extracted features via fully connected layers. A performance analysis of pretrained variants of SWIN transformers is also presented in this paper for the choice of the model. Four public benchmark datasets, namely, CUHK Avenue, University of Minnesota (UMN), AIRTLab, and Industrial Surveillance (IS) are employed for analysis and implementations. The proposed framework outperformed existing state of the art methods by 18% and 2–20% with accuracy of 98.58% (IS) and 100% (Avenue) respectively.
Article
Full-text available
Human-Object Interaction (HOI) recognition, due to its significance in many computer vision-based applications, requires in-depth and meaningful details from image sequences. Incorporating semantics in scene understanding has led to a deep understanding of human-centric actions. Therefore, in this research work, we propose a semantic HOI recognition system based on multi-vision sensors. In the proposed system, the de-noised RGB and depth images, via Bilateral Filtering (BLF), are segmented into multiple clusters using a Simple Linear Iterative Clustering (SLIC) algorithm. The skeleton is then extracted from segmented RGB and depth images via Euclidean Distance Transform (EDT). Human joints, extracted from the skeleton, provide the annotations for accurate pixel-level labeling. An elliptical human model is then generated via a Gaussian Mixture Model (GMM). A Conditional Random Field (CRF) model is trained to allocate a specific label to each pixel of different human body parts and an interaction object. Two semantic feature types that are extracted from each labeled body part of the human and labelled objects are: Fiducial points and 3D point cloud. Features descriptors are quantized using Fisher’s Linear Discriminant Analysis (FLDA) and classified using K-ary Tree Hashing (KATH). In experimentation phase the recognition accuracy achieved with the Sports dataset is 92.88%, with the Sun Yat-Sen University (SYSU) 3D HOI dataset is 93.5% and with the Nanyang Technological University (NTU) RGB+D dataset it is 94.16%. The proposed system is validated via extensive experimentation and should be applicable to many computer-vision based applications such as healthcare monitoring, security systems and assisted living etc.
Article
Full-text available
Fast and automated recognizing of abnormal behaviors in crowded scenes is significantly effective in increasing public security. The traditional procedure of recognizing abnormalities in the Web of Thing (WoT) platform comprises monitoring the activities and describing the crowd properties such as density, trajectory, and motion pattern from the visual frames. Accordingly, incorporating real-time security monitoring based on the WoT platform and machine learning algorithms would significantly enhance the influential detection of abnormal behaviors in the crowds. This paper addresses various automatic and real-time surveillance methods for abnormal event detection to recognize the dynamic crowd behavior in security applications. The critical aspect of security and protection of public places is that we cannot manually monitor the unpredictable and complex crowded environments. The abnormal behavior algorithms have attempted to improve efficiency, robustness against pixel occlusion, generalizability, computational complexity, and execution time. Similar to the state-of-the-art abnormal behavior detection of crowded scenes, we broadly classified methods into different categories such as tracking, classification based on handcrafted extracted features, classification based on deep learning, and hybrid approaches. Hybrid and deep learning methods have been found to have more satisfactory results in the classification stage. A set of video frames called Motion Emotion Dataset (MED) is employed in this study to examine the various conditions governing these methods. Incorporating an appropriate real-time approach with considering WoT platform can facilitate the analysis of crowd and individuals’ behavior for security screening of abnormal events.
Article
Full-text available
Featured Application: The proposed technique is an application for people detection and counting which is evaluated over several challenging benchmark datasets. The technique can be applied in heavy crowd assistance systems that help to find targeted persons, to track functional movements and to maximize the performance of surveillance security. Abstract: Automatic head tracking and counting using depth imagery has various practical applications in security, logistics, queue management, space utilization and visitor counting. However, no currently available system can clearly distinguish between a human head and other objects in order to track and count people accurately. For this reason, we propose a novel system that can track people by monitoring their heads and shoulders in complex environments and also count the number of people entering and exiting the scene. Our system is split into six phases; at first, prepro-cessing is done by converting videos of a scene into frames and removing the background from the video frames. Second, heads are detected using Hough Circular Gradient Transform, and shoulders are detected by HOG based symmetry methods. Third, three robust features, namely, fused joint HOG-LBP, Energy based Point clouds and Fused intra-inter trajectories are extracted. Fourth, the Apriori-Association is implemented to select the best features. Fifth, deep learning is used for accurate people tracking. Finally, heads are counted using Cross-line judgment. The system was tested on three benchmark datasets: the PCDS dataset, the MICC people counting dataset and the GOTPD dataset and counting accuracy of 98.40%, 98%, and 99% respectively was achieved. Our system obtained remarkable results.
Article
Full-text available
Abstract: To prevent disasters and to control and supervise crowds, automated video surveillance has become indispensable. In today’s complex and crowded environments, manual surveillance and monitoring systems are inefficient, labor intensive, and unwieldy. Automated video surveillance systems offer promising solutions, but challenges remain. One of the major challenges is the extraction of true foregrounds of pixels representing humans only. Furthermore, to accurately understand and interpret crowd behavior, human crowd behavior (HCB) systems require robust feature extraction methods, along with powerful and reliable decision-making classifiers. In this paper, we describe our approach to these issues by presenting a novel Particles Force Model for multi-person tracking, a vigorous fusion of global and local descriptors, along with a robust improved entropy classifier for detecting and interpreting crowd behavior. In the proposed model, necessary preprocessing steps are followed by the application of a first distance algorithm for the removal of background clutter; true-foreground elements are then extracted via a Particles Force Model. The detected human forms are then counted by labeling and performing cluster estimation, using a K-nearest neighbors search algorithm. After that, the location of all the human silhouettes is fixed and, using the Jaccard similarity index and normalized cross-correlation as a cost function, multi-person tracking is performed. For HCB detection, we introduced human crowd contour extraction as a global feature and a particles gradient motion (PGD) descriptor, along with geometrical and speeded up robust features (SURF) for local features. After features were extracted, we applied bat optimization for optimal features, which also works as a pre-classifier. Finally, we introduced a robust improved entropy classifier for decision making and automated crowd behavior detection in smart surveillance systems. We evaluated the performance of our proposed system on a publicly available benchmark PETS2009 and UMN dataset. Experimental results show that our system performed better compared to existing well-known state-of-the-art methods by achieving higher accuracy rates. The proposed system can be deployed to great benefit in numerous public places, such as airports, shopping malls, city centers, and train stations to control, supervise, and protect crowds.
Article
Full-text available
Based on the rapid increase in the demand for people counting and tracking systems for surveillance applications, there is a critical need for more accurate, efficient, and reliable systems. The main goal of this study was to develop an accurate, sustainable, and efficient system that is capable of error-free counting and tracking in public places. The major objective of this research is to develop a system that can perform well in different orientations, different densities, and different backgrounds. We propose an accurate and novel approach consisting of preprocessing, object detection, people verification, particle flow, feature extraction, self-organizing map (SOM) based clustering, people counting, and people tracking. Initially, filters are applied to preprocess images and detect objects. Next, random particles are distributed, and features are extracted. Subsequently, particle flows are clustered using a self-organizing map, and people counting and tracking are performed based on motion trajectories. Experimental results on the PETS-2009 dataset reveal an accuracy of 86.9% for people counting and 87.5% for people tracking, while experimental results on the TUD-Pedestrian dataset yield 94.2% accuracy for people counting and 94.5% for people tracking. The proposed system is a useful tool for medium-density crowds and can play a vital role in people counting and tracking applications.
Article
Full-text available
The monitoring of human physical activities using wearable sensors, such as inertial-based sensors, plays a significant role in various current and potential applications. These applications include physical health tracking, surveillance systems, and robotic assistive technologies. Despite the wide range of applications, classification and recognition of human activities remains imprecise and this may contribute to unfavorable reactions and responses. To improve the recognition of human activities, we designed a dataset in which ten participants (five male and five female) performed 11 different activities wearing three body-worn inertial sensors in different locations on the body. Our model extracts data via a hierarchical feature-based technique. These features include time, wavelet, and time-frequency domains, respectively. Stochastic gradient descent (SGD) is then introduced to optimize selective features. The selected features with optimized patterns are further processed by multi-layered kernel sliding perceptron to develop adaptive learning for the classification of physical human activities. Our proposed model was experimentally evaluated and applied on three benchmark datasets: IM-WSHA, a self-annotated dataset, PAMAP2 dataset which is comprised of daily living activities, and an HuGaDB, a dataset which contains physical activities for aging people. The experimental results show that the proposed method achieves better results and outperforms others in terms of recognition accuracy, achieving an accuracy rate of 83.18%, 94.16%, and 92.50% respectively, when IM-WSHA, PAMAP2, and HuGaDB datasets are applied.
Article
Full-text available
To understand daily events accurately, adaptive pose estimation (APE) systems require a robust context-aware model and optimal feature selection methods. In this paper, we propose a novel gait event detection (GED) system that consists of sali-ency silhouette detection, a robust body parts model and a 2D stick-model followed by a hierarchical optimization algorithm. Furthermore, the most prominent context-aware features such as energy, 0–180° intensity and distinct moveable features are proposed by focusing on invariant and localized characteristics of human postures in different event classes. Finally, we apply Grey Wolf optimization and a genetic algorithm to discriminate complex postures and to provide appropriate labels to each event. In order to evaluate the performance of proposed GED, two public benchmark datasets, UCF101 and YouTube, are examined via the n-fold cross validation method. For the two benchmark datasets, our proposed method detects the human body key points with 82.4% and 83.2% accuracy respectively. Also, it extracts the context-aware features and finally recognizes the gait events with 82.6% and 85.0% accuracy, respectively. Compared with other well-known statistical and state-of-the-art methods, our proposed method outperforms other similarly tasked methods in terms of posture detection and recognition accuracy (PDF) Adaptive Pose Estimation for Gait Event Detection Using Context-Aware Model and Hierarchical Optimization. Available from: https://www.researchgate.net/publication/351047404_Adaptive_Pose_Estimation_for_Gait_Event_Detection_Using_Context-Aware_Model_and_Hierarchical_Optimization [accessed Apr 22 2021].
Article
Full-text available
Automated human posture estimation (A-HPE) systems need delicate methods for detecting body parts and selecting cues based on marker-less sensors to effectively recognize complex activity motions. Recognition of human activities using vision sensors is a challenging issue due to variations in illumination conditions and complex movements during the monitoring of sports and fitness exercises. In this paper, we propose a novel A-HPE method that intelligently identifies human behaviours by utilizing saliency silhouette detection, robust body parts model and multidimensional cues from full-body silhouettes followed by an entropy Markov model. Initially, images are pre-processed and noise is removed to obtain a robust silhouette. Body parts models are then used to extract twelve key body parts. These key body parts are further optimized to assist the generation of multidimensional cues. These cues include energy, optical flow and distinctive values that are fed into quadratic discriminant analysis to discriminate cues which help in the recognition of actions. Finally, these optimized patterns are further processed by a maximum entropy Markov model as a recognizer engine based on transition and emission probability values for activity recognition. For evaluation, we used a leave-one-out cross validation scheme and the results outperformed existing well-known statistical state-of-the-art methods by achieving better body parts detection and higher recognition accuracy over four benchmark datasets. The proposed method will be useful for man-machine interactions such as 3D interactive games, virtual reality, service robots, e-health fitness, and security surveillance. Graphical AbstractDesign model of automatic posture estimation and action recognition.
Article
This paper focuses on the unsupervised domain adaptation problem for video-based crowd counting, in which we use labeled data as source domain and unlabelled video data as target domain. It is challenging as there is a huge gap between the source and the target domain and no annotations of samples are available in the target domain. The key issue is how to utilize unlabelled videos in the target domain for knowledge learning and transferring from the source domain. To tackle this problem, we propose a novel Error-aware Density Isomorphism REConstruction Network (EDIREC-Net) for cross-domain crowd counting. EDIREC-Net jointly transfers a pre-trained counting model to target domains using a density isomorphism reconstruction objective and models the reconstruction erroneousness by error reasoning. Specifically, as crowd flows in videos are consecutive, the density maps in adjacent frames turn out to be isomorphic. On this basis, we regard the density isomorphism reconstruction error as a self-supervised signal to transfer the pre-trained counting models to different target domains. Moreover, we leverage an estimation-reconstruction consistency to monitor the density reconstruction erroneousness and suppress unreliable density reconstructions during training. Experimental results on four benchmark datasets demonstrate the superiority of the proposed method and ablation studies investigate the efficiency and robustness. The source code is available at https://github.com/GehenHe/EDIREC-Net.