Conference PaperPDF Available

A Novel Deep Learning Model for Understanding Two-Person Interactions Using Depth Sensors

Authors:

Abstract and Figures

Despite the ever-increasing efforts made in the field of data science and artificial intelligence, the task of automatic human interaction recognition remains challenging. Advanced computer vision sensors like depth sensors have made it easier to achieve the goal of accurate recognition of human interactions in complex situations. The reason for their success is that they are robust against lighting and illumination variation and are insensitive to color and texture changes. Therefore, the proposed system combines both RGB and depth images to train a Convolutional Neural Network (CNN). The robust features extracted from CNN have been classified using a Softmax classifier. Two publically available large RGB-D datasets have been used for training and evaluating the performance of the proposed method. The proposed method has achieved an accuracy of 87.03% with the NTU RGB+D dataset and 86.21% with the UoL 3D Social Interaction dataset.
Content may be subject to copyright.
2021 International Conference on Innovative Computing (ICIC)
978-1-6654-0091-6/21/$31.00 ©2021 IEEE
A Novel Deep Learning Model for Understanding
Two-Person Interactions Using Depth Sensors
Manahil Waheed
Dept. of Creative Technologies,
Air University
Islamabad, Pakistan
manahilwaheedar@gmail.com
Madiha Javeed
Dept. of Computer Science,
Air University
Islamabad, Pakistan
191880@students.au.edu.pk
Ahmad Jalal
Dept. of Computer Science
Air University
Islamabad, Pakistan
ahmadjalal@mail.au.edu.pk
AbstractDespite the ever-increasing efforts made in the field of
data science and artificial intelligence, the task of automatic human
interaction recognition remains challenging. Advanced computer
vision sensors like depth sensors have made it easier to achieve the
goal of accurate recognition of human interactions in complex
situations. The reason for their success is that they are robust against
lighting and illumination variation and are insensitive to color and
texture changes. Therefore, the proposed system combines both RGB
and depth images to train a Convolutional Neural Network (CNN).
The robust features extracted from CNN have been classified using
a Softmax classifier. Two publically available large RGB-D datasets
have been used for training and evaluating the performance of the
proposed method. The proposed method has achieved an accuracy of
87.03% with the NTU RGB+D dataset and 86.21% with the UoL 3D
Social Interaction dataset.
Keywordsconvolution neural network, deep learning, depth
videos, human interaction recognition, softmax classifier.
I. INTRODUCTION
Human interaction recognition (HIR) refers to the task of
understanding a mutual activity performed by two human beings.
This field has attracted many researchers owing to its wide range
of applications, including security [1-4], smart homes [5-10],
content-based video retrieval [11-14], healthcare [15-20],
surveillance [21-24], and human tracking [25-29]. However, it is
a complex task because of multiple reasons, such as change of
viewpoint, occlusion, variation in clothing and lighting
conditions, low-resolution images, and unavailability of large
datasets. Some progress has been made ever since the introduction
of low-cost depth sensors, such as Microsoft Kinect [30-32],
because they are not as affected by lightning conditions as RGB
cameras.
This research proposes a fusion of RGB and depth images to
train a CNN model. The UoL (University of Lincoln) 3-D Social
Interaction dataset provides RGB and depth images. The NTU
RGB+D (Nanyang Technological University's Red Blue Green
and Depth) dataset comprises RGB videos and the corresponding
depth maps. Hence, the RGB videos have been converted into
image frames. To reduce the computational complexity, only 10
keyframes have been selected from each video. The keyframes
have been extracted by comparing the histograms of consecutive
frames. The differences between the histograms of every two
consecutive image frames have been stored in an array and the
frames with the highest differences have been selected as
keyframes afterward. Once keyframes have been extracted from
RGB videos, they have been combined with the corresponding
depth frames. Next, the 4-dimensional images have been fed to a
CNN model that uses VGG-16 (Visual Geometry Group-16 layers
deep) [33] as the base model. Finally, a Softmax classifier has
been proposed for classification.
Similar research work is described in Section II and the
proposed methodology is discussed in Section III. Section IV
presents the implementation details and results of the proposed
method. The conclusion of the research is given in Section V.
II. RELATED WORK
Recent years have seen a lot of progress in the field of human
activity recognition [34-38]. However, identifying interactions
between two human beings is a more challenging task [39]. For
this purpose, many researchers have preferred RGBD data over
RGB data [40-45]. With the availability of this additional depth
information, depth gradients can also be used to extract local
features [46-49]. Moreover, both sensor-based [50-56] and vision-
based [57-60] HIR systems have been developed in the past.
The first step in recognizing human interactions in videos is to
represent events and scenes as image features [61-65]. Based on
those features, an interaction class is assigned to the input video
[66-70]. Another important step during feature extraction is the
identification of key body parts [71-74] and pose estimation [75-
77]. A common approach is to extract hierarchical features
[78,79] from human bodies. Some researchers have also chosen
hybrid features for better classification results [80-83]. For
example, researchers in [84] used a combination of different
blobs, multiple orientations, Fourier transforms, and
geometrical points over the objects as features. A. Jalal et al. [85]
extracted various features, including energy, sine, distinct body
parts movements, and a 3D Cartesian view of smoothing gradients
features. Similarly, a hybrid of four different local descriptors was
used by the authors of [86], i.e., spatio-temporal features, energy-
based features, shape-based angular and geometric features, and a
motion-orthogonal histogram of oriented gradient (MO-HOG).
CNN has very extensively been used for classification purposes
[87-90]. And it has also proved efficient as a feature extractor.
[91,92] have used CNN as an encoder for their image captioning
systems while using inception V3 as the base model.
III. METHODOLOGY
This section discusses the proposed methodology for HIR. The
system takes both RGB and depth videos as input. The videos are
first converted into images at the rate of 31 frames per second and
then 10 keyframes are extracted from each video. Pre-processing
is done over the extracted keyframes to enhance the image quality,
making it easier to extract the desired features. These pre-
processed images are then used for human detection and
segmentation. The segmented RGB and depth images are
concatenated and then fed to a CNN model, which extracts
important features from them. These features are then given to the
Softmax classifier that generates the class labels. Fig. 1 shows an
overview of this method.
Fig. 1. A general overview of the proposed architecture.
A. Preprocessing
The NTU RGB+D dataset provides RGB videos and depth
frames. Hence, the RGB videos have been converted into frames
to get the same number of frames against each RGB video as the
depth maps. Since the dataset has 48 videos per class and this
research uses 11 classes, it is computationally very expensive to
keep all the extracted frames. Therefore, only 10 keyframes have
been extracted from each video. The keyframes have been
extracted by computing the differences between the histograms of
every two consecutive frames. The top ten frames corresponding
to the highest differences have been selected. All RGB and depth
images have been cropped to obtain the desired regions. Then they
have been pre-processed using multiple techniques discussed in
detail below. Applying such preprocessing techniques helps
improve the overall accuracy of the system.
1) Histogram Equalization:
To improve the quality of the images, the contrast is enhanced
by adjusting the intensity values. This is done by the histogram
equalization technique. Eq. 1 represents the normalized histogram
 of an image , as indicated by eq. (1).

    (1)
where is the number of pixels with intensity , ranges
from 0 to    ( is 256), and is the total number of pixels in
the image. The histogram equalized image is defined by eq. (2).
 
 (2)
The results of histogram equalization are shown in Fig. 2.
(a) (b) (c) (d)
Fig. 2. Histogram Equalization. (a) original image, (b) histogram of original image,
(c) image after histogram equalization, and (d) histogram of the equalized image.
2) Image Smoothing
After histogram equalization, all images have been de-noised
using mean filtering. In this method, each pixel value in the image
has been replaced by the mean of its neighboring pixels, as shown
in eq. (3).

 (3)
where and j are the pixel values and is the window size,
i.e., the number of neighboring pixels.
B. Image Segmentation
Image segmentation reduces the complexity of the image as it
returns only the desired part of the image. Moreover, it makes
sense to remove the redundant background from all the images so
the features of the background, which will be the same for
different classes, do not play a part while determining the
interaction class. In order to segment human beings from RGB
and depth images, two image segmentation techniques have been
used, as discussed in detail below.
1) RGB Image Segmentation
Humans have been segmented from the RGB images using the
edge detection technique. Edges or boundaries are detected based
on discontinuity in the intensity values of the pixels. For this
purpose, all RGB images are first converted into grayscale images
and a binary silhouette is extracted using the detected edges. A
floor detection and removal technique has also been implemented
for the NTU RGB+D dataset where the floor often gets
misclassified as the foreground. Based on the range of its intensity
values, a floor mask has been created, which is then used to remove
the floor. The original RGB pixel values are then restored in the
detected binary silhouettes to get the desired RGB silhouettes. Fig.
3 shows the results of the RGB image segmentation stage.
(a)
(b)
Fig. 3. RGB silhouette segmentation (a) hugging interaction (NTU RGB+D)
original image (left), binary silhouette (center), and segmented image (right), (b)
‘shaking hands’ interaction (UoL 3D) original image (left), binary silhouette
(center), and segmented image (right).
2) Depth Image Segmentation
For depth images, Ostu’s thresholding technique has been used
[93]. The intensity values of the depth images available in the NTU
RGB+D dataset have been adjusted as they were too dark to be
seen. Then the cropped and intensity-adjusted images have been
segmented using Ostu’s thresholding technique shown in eq. (4).
Fig. 4 shows the results of the depth image segmentation stage.

 
 
(4)
where
is the weighted sum of intra-class variances of the
two classes (foreground and background) and is the threshold
value.
(a)
(b)
Fig. 4. Depth silhouette segmentation (a) kicking interaction (NTU RGB+D)
original intensity-adjusted image (left), binary silhouette (center), and segmented
image (right) (b) help stand up interaction (UoL 3D) original image (left), binary
silhouette (center), and segmented image (right).
C. Feature Extraction via CNN
For extraction of features from images, a Convolutional Neural
Network (CNN) has been used. The transfer learning approach
has been employed, which includes using VGG16 as the base
model and then fine-tuning its weights according to the used
datasets. VGG16 is a CNN model that achieves 92.7% on the
ImageNet dataset which has 1000 classes. Fig. 5 shows all the
layers in the VGG16 model.
Fig. 5. Different layers of the VGG16 architecture with configurations.
First, the testing and training input images have been trained
on the VGG16 model and the resulting images having the
dimensions of 7x7x512 have been obtained. Then these have been
trained on the proposed CNN model which has three
convolutional layers with 128, 64, and 32 filters respectively. The
convolution layers compute the output of neurons that are
connected to local regions in the input. Convolution is similar to
sliding a filter over an image, computing the dot product of filter
weights and image pixels. Rectified Linear Unit (RELU) has been
used as the activation function for all three convolution layers. It
simply rounds up all the negative values to zero. Then a batch
normalization layer followed by a flatten layer has been used.
Lastly, a drop out layer of 0.2 has been used to avoid overfitting.
Fig. 6 shows the layers in the proposed model. Table I shows a
summary of the proposed CNN model.
Fig. 6. A general overview of the layers in our CNN model.
Conv:128
Conv:64
Conv:32
BatchNorm
Flatten
Drop out: 0.2
7x7x512
1x1x10
00
Convolution
2
Max pooling
Flatten
Softmax
1x1x40
96
TABLE I. A BRIEF SUMMARY OF OUR CNN MODEL
Layer
Output Shape
Parameters
Conv:128
(None,7,7,128)
65664
Conv:64
(None,7,7,64)
8256
Con:32
(None,7,7,32)
2080
BatchNorm
(None,7,7,32)
128
Flatten
(None,7,7,1568)
0
Dropout
(None,7,7,1568)
0
Softmax
(None,7,7,11)
17259
D. Human Interaction Recognition Using Softmax
After extracting features using CNN, the softmax classifier has
been used to recognize human interactions. The softmax function
is a popular choice for multiclass classification [94]. It readjusts
the probabilities of all the classes such that the sum of their
probabilities is 1. The softmax output for each class is computed
using eq. (5).

 (5)
where is the probability of each class, is the class, and n is
the total number of classes. The normalized sum of these
probabilities is always equal to 1.
IV. EXPERIMENTAL SETUP AND RESULTS
This section gives a brief description of the datasets used for
experimentation, the implementation details, and the results of
different experiments conducted to evaluate the performance of the
proposed HIR model. The results also contain a comparison of the
proposed system’s accuracy with that of other state-of-the-art
systems.
Datasets
1) NTU RGB+D dataset
The NTU RGB+D dataset [95,96] consists of 60 classes, 11 of
which are two-person interactions: punching, kicking, pushing, pat
on back, point finger, hugging, giving object, touch pocket,
shaking hands, walking towards, and walking apart. There are 48
videos for each interaction class. Each session has three sets of
videos since each video is recorded from three different
viewpoints. Fig. 7 shows a few sample frames from this dataset.
(a)
(b)
Fig.7. NTU RGB+D dataset. (a) RGB frame for ‘giving object’ (left), RGB frame
for ‘punching’ (center), and RGB frame for ‘pat on back’ (right), (b) depth frame
for ‘giving object’ (left), depth frame for ‘punching’ (center), and depth frame for
‘pat on back’ (right).
2) UoL 3D Social Interaction dataset
The UoL 3D social interaction dataset [97] provided RGB+D
videos and skeleton information of 8 interaction classes: shaking
hands, talk, help walk, help stand up, hug, push, fight, and draw
attention. This dataset includes ten sessions, each comprising two
long videos containing all eight interactions. The skeleton tracks
are provided in a text format. Information about 25 skeleton joints
is provided. Fig. 8 shows a few sample frames from this dataset.
(a)
(b)
Fig.8. UoL 3D Social Interaction dataset. (a) RGB frame for ‘hugging’ (left), RGB
frame for ‘shaking hands’ (center), and RGB frame for ‘kicking’ (right), (b) depth
frame for ‘hugging’ (left), depth frame for shaking hands’ (center), and depth
frame for ‘kicking’ (right).
Implementation
The proposed CNN model has been developed in Python using
Jupyter Notebook. Python’s deep learning library, Keras, has been
used as it provides the VGG-16 model and different layers for
convolution, batch normalization, flattening, drop out, and
softmax. The proposed model has been trained for 30 epochs. Fig.
9 shows how the model’s accuracy increased and loss decreased
with the increase in the number of epochs.
(a)
(b)
Fig.9. Accuracy and loss graphs: (a) model accuracy increased with increasing
epochs and (b) model loss decreased with increasing epochs.
Results
To evaluate the performance of the proposed system, a one-
third training validation test has been used. Tables II and III show
the average accuracies achieved by the proposed model and the
accuracies achieved per interaction class over the NTU RGB+D
and UoL 3D Social Interaction datasets respectively.
TABLE II. RECOGNITION ACCURACIES OF CLASSES OF NTU RGB+D DATASET
Class
Accuracy
(%)
Class
Accuracy
(%)
Class
Accuracy
(%)
punching
82.23
pat on
back
79.94
giving
object
80.76
kicking
94.72
point
finger
95.15
touch
pocket
83.05
pushing
82.02
hugging
95.15
shaking
hands
92.51
walking
towards
89.24
walking
apart
82.56
average
accuracy
87.03
TABLE III. RECOGNITION ACCURACIES OF CLASSES OF UOL 3D DATASET
Class
Accuracy
(%)
Class
Accuracy
(%)
Class
Accuracy
(%)
handshake
81.12
help
stand
up
83.21
fight
86.04
talk
88.45
hug
90.02
draw
attention
88.21
walking
towards
89.24
walking
apart
82.56
average
accuracy
87.03
Tables IV and V show a comparison of the results of the
proposed system with two state-of-the-art (SOTA) classifiers:
Bayesian and Random Forest [98]. The proposed method also
outperforms some recent state-of-the-art methods as shown in
Tables VI and VII.
Table IV. COMPARISON OF THE PROPOSED METHOD WITH SOTA CLASSIFIERS
OVER NTU RGB+D DATASET
Interaction
Class
Recognition Accuracy (%)
Random Forest
Bayesian
Proposed method
punching
47.43
96.41
82.23
kicking
91.56
72.05
94.72
pushing
86.74
92.56
82.02
pat on back
79.48
84.36
79.94
point finger
91.56
79.49
95.15
hugging
87.74
74.62
95.15
giving object
73.07
71.54
80.76
touch pocket
72.35
76.67
83.05
shaking hands
87.17
83.07
92.51
walking towards
88.74
70.26
89.24
walking apart
50.24
70.26
82.56
average accuracy
77.82
79.21
87.03
Table V. COMPARISON OF THE PROPOSED METHOD SOTA CLASSIFIERS OVER UOL
3D DATASET
Interaction
Class
Recognition Accuracy (%)
Random
Forest
Bayesian
Proposed method
handshake
52.43
84.41
81.12
talk
85.71
82.05
88.45
help walk
82.45
72.56
85.32
help stand up
79.48
74.36
83.21
hug
85.71
79.49
90.02
push
78.34
74.62
87.30
fight
77.07
81.54
86.04
draw attention
74.35
76.67
88.21
average
accuracy
76.94
78.21
86.21
TABLE VI. COMPARISON OF THE PROPOSED METHOD WITH SOTA METHODS OVER
NTU RGB+D DATASET
Authors
Methods
Accuracy (%)
Songyang et al.[99]
geometric features
70.26
Inwoong et al.[100]
ensemble TS-LSTM v2
74.60
Amir et al.[101]
deep multimodal features
74.9
Junwoo et al.[102]
mobile robot platform
75.0
Mengyuan et
al.[103]
enhanced skeletal visualization
75.97
proposed model
87.03
TABLE VII. COMPARISON OF THE PROPOSED METHOD WITH SOTA METHODS OVER
UOL 3D DATASET
Authors
Methods
Accuracy
(%)
Claudio et al.[104]
probabilistic merging of
skeletal features
85.1
Muhammad et
al.[105]
multimodal feature level fusion
85.12
Claudio et al.[97]
statistical and geometrical
features
85.56
proposed model
86.21
V. CONCLUSION
In this paper, an HIR system has been proposed that efficiently
recognizes complex human-to-human interactions using both
RGB and depth information. The performed experiments have
shown that RGB-D images give better results than RGB images.
Furthermore, using only 10 keyframes instead of the entire videos
takes less time in model training.
As future work, the researchers plan to explore new and better
ways of fusing RGB and depth images for a more efficient
classification system. It is also intended to train and evaluate the
proposed system on more challenging datasets.
REFERENCES
[1] O. Aran and D. Gatica-Perez, “One of a kind: Inferring personality
impressions in meetings,” in Proc. on ICMI (ACM), pp. 1118, 2013.
[2] A. Jalal, S. Kamal, and D. Kim, “Depth map-based human activity tracking
and recognition using body joints features and self-organized map,” in Proc.
on CCNT, pp. 1-6, 2014.
[3] A. Jalal and Y. Kim, Dense depth maps-based human pose tracking and
recognition in dynamic scenes using ridge data,” in Proc. on Advanced
Video and Signal-based Surveillance, pp. 119-124, 2014.
[4] A. Jalal, S. Kamal, and D. Kim, “Shape and motion features approach for
activity tracking and recognition from Kinect video camera,” in Proc. on
Advanced Information Networking and Applications Workshops, pp. 445-
450, 2015.
[5] A. Jalal, N. Sharif, J.T. Kim, and T.S. Kim, “Human activity recognition via
recognized body parts of human depth silhouettes for residents monitoring
services at smart homes, in Indoor and Built Environment, vol. 22, pp. 271-
279, 2013.
[6] A. Jalal, M.A.K. Quaid, and M.A. Sidduqi, “A triaxial acceleration-based
human motion detection for ambient smart home system,” in Proc. on
Applied Sciences and Technology, 2019.
[7] A. Jalal, S. Lee, J. Kim, and T. Kim, “Human activity recognition via the
features of labeled depth body parts,” in Proc. on Smart Homes Health
Telematics, pp. 246-249, 2012.
[8] A. Jalal, J.T. Kim, and T.S Kim, “Development of a life logging system via
depth imaging-based human activity recognition for smart homes,” in Proc.
on Sustainable Healthy Buildings, pp. 91-95, 2012.
[9] A. Jalal, S. Kamal, and D. Kim, “A depth video sensor-based life-logging
human activity recognition system for elderly care in smart indoor
environments,” in Sensors, vol. 14, pp. 11735-11759, 2014.
[10] T. Kim, A. Jalal, H. Han, H. Jeon, and J. Kim, “Real-time life logging via
depth imaging-based human activity recognition towards smart homes
services,” in Proc. on Renewable Energy Sources and Healthy Buildings, pp.
63, 2013.
[11] G.H. Liu, J.Y. Yang, and Z. Li, “Content-based image retrieval using
computational visual attention model,” in Pattern Recognition, vol. 48, pp.
25542566, 2015.
[12] S. Sempena, N.U. Maulidevi, and P.R. Aryan, “Human action recognition
using dynamic time warping,” in Proc. on ICEEI, pp. 15, 2011.
[13] A. Jalal, S. Kamal, and D. Kim, “Facial expression recognition using 1D
transform features and hidden markov model,” in Journal of Electrical
Engineering & Technology, vol. 12, pp. 1657-1662, 2017.
[14] M. Mahmood, A. Jalal, and H. A. Evans, “Facial expression recognition in
image sequences using 1D transform and gabor wavelet transform,” in Proc.
on Applied and Engineering Mathematics, 2018.
[15] A. Jalal, M. Batool, and K. Kim, “Stochastic recognition of physical activity
and healthcare using tri-axial inertial wearable sensors,” in Applied
Sciences, 2020.
[16] A. Jalal, M.A.K. Quaid, S.B. Tahir, and K. Kim, “A study of accelerometer
and gyroscope measurements in physical life-log activities detection
systems,” in Sensors, 2020.
[17] A. Jalal, M. Batool, and K. Kim, “Sustainable wearable system: human
behavior modeling for life-logging activities using k-ary tree hashing
classifier,” in Sustainability, 2020.
[18] M. Javeed, A. Jalal, and K. Kim, “Wearable sensors based exertion
recognition using statistical features and random forest for physical
healthcare monitoring,” in Proc. on Applied Sciences and Technology, 2021.
[19] A. Jalal, M. Batool and B. Tahir, “Markerless sensors for physical health
monitoring system using ECG and GMM feature extraction,” in Proc. on
IBCAST, 2021.
[20] A. Jalal, M.A.K. Quaid, and A.S. Hasan, “Wearable sensor-based human
behavior understanding and recognition in daily life for smart
environments, in Proc. on Frontiers of Information Technology, 2018.
[21] A. Shehzad, A. Jalal, and K. Kim, “Multi-person tracking in smart
surveillance system for crowd counting and normal/abnormal events
detection, in Proc. on Applied and Engineering Mathematics, 2019.
[22] P. Mahwish, G. Yazeed, M. Gochoo, A. Jalal, S. Kamal, and D. Kim, “A
smart surveillance system for people counting and tracking using particle
flow and modified SOM,” in Sustainability, 2021.
[23] P. Mahwish, A. Jalal, and K. Kim, “Hybrid algorithm for multi people
counting and tracking for smart surveillance,” in Proc. on IBCAST, 2021.
[24] N. Khalid, M. Gochoo, A. Jalal, and K. Kim, “Modeling two-person
segmentation and locomotion for stereoscopic action identification: a
sustainable video surveillance system, in Sustainability, 2021.
[25] A. Jalal, Y. Kim, S. Kamal, A. Farooq, and D. Kim, “Human daily activity
recognition with joints plus body features representation using Kinect
sensor,” in Proc. on Informatics, Electronics, and Vision, 2015.
[26] A. Jalal, S. Kamal, A. Farooq, and D. Kim, “A spatiotemporal motion
variation features extraction approach for human tracking and pose-based
action recognition,” in Proc. on Informatics, Electronics, and Vision, 2015.
[27] A. Nadeem, A. Jalal, and K. Kim, “Human actions tracking and recognition
based on body parts detection via artificial neural network,” in Proc. on
Advancements in Computational Sciences, 2020.
[28] S. Kamal, A. Jalal, and D. Kim, “Depth images-based human detection,
tracking and activity recognition using spatiotemporal features and Modified
HMM, in Journal of Electrical Engineering and Technology, pp. 1921-1926,
2016.
[29] A. Jalal, M. Mahmood, and A. S. Hasan, “Multi-features descriptors for
human activity tracking and recognition in Indoor-outdoor environments,”
in Proc. on Applied Sciences and Technology, 2019.
[30] M. Asadi-Aghbolaghi, et al., “A survey on deep learning based approaches
for action and gesture recognition in image sequences, in Proc. on
Automatic Face & Gesture Recognition, 2017.
[31] A. Jalal, S. Kamal, and D. Kim, “Human depth sensors-based activity
recognition using spatiotemporal features and hidden markov model for
smart environments, in Journal of Computer Networks and
Communications, vol. 2016, pp. 1-11, 2016.
[32] A. Jalal, S. Kamal, and D. Kim, “Depth Silhouettes Context: A new robust
feature for human tracking and activity recognition based on embedded
HMMs,” in Proc. on Ubiquitous Robots and Ambient Intelligence, pp. 294-
299, 2015.
[33] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition, in Proc. on Learning Representations, 2015.
[34] A. Jalal, J.T. Kim, and T.S. Kim, “Human activity recognition using the
labeled depth body parts information of depth silhouettes,” in Proc. on
Sustainable Healthy Buildings, pp. 1-8, 2012.
[35] A. Jalal, M.Z. Uddin, and T.S. Kim, “Depth video-based human activity
recognition system using translation and scaling invariant features for life
logging at smart home,” in IEEE Transaction on Consumer Electronics, vol.
58, pp. 863-871, 2012.
[36] A. Jalal, M.Z. Uddin, J.T. Kim, and T.S. Kim, “Daily human activity
recognition using depth Silhouettes and R transformation for smart home,”
in Proc. on Smart Homes Health Telematics, pp. 25-32, 2011.
[37] S. Badar, A. Jalal, and M. Batool, “Wearable Sensors for activity analysis
using SMO-based random forest over smart home and sports datasets”, in
Proc. on ICACS, 2020.
[38] S. Badar, A. Jalal, and K. Kim, “Wearable inertial sensors for daily activity
analysis based on Adam optimization and the maximum entropy markov
model”, in Entropy, vol. 22, pp. 1-19, 2020.
[39] A. Stergiou and R Poppe1, “Understanding human-human interactions: a
survey,” in Computer Vision and Image Understanding, 2019.
[40] A. Jalal, Y. Kim, and D. Kim, Ridge body parts features for human pose
estimation and recognition from RGB-D video data,” in Proc. on Computing,
Communication and Networking Technologies, pp. 1-6, 2014.
[41] M. Mahmood, A. Jalal, and K. Kim, “WHITE STAG model: Wise human
interaction tracking and estimation (WHITE) using spatio-temporal and
angular-geometric (STAG) Descriptors”, in Multimedia Tools and
Applications, 2020.
[42] A. Farooq, A. Jalal, and S. Kamal, “Dense RGB-D map-based human
tracking and activity recognition using skin joints features and self-
organizing map,” in KSII Transactions on internet and information systems,
vol. 9, pp. 1856-1869, 2015.
[43] A. Ahmed, A. Jalal, and K. Kim, “RGB-D images for object segmentation,
localization and recognition in indoor scenes using feature descriptor and
Hough voting”, in Proc. on Applied Sciences and Technology, 2020.
[44] M. Gochoo, S.R. Amna, G. Yazeed, A. Jalal, S. Kamal, and D. Kim, “A
systematic deep learning based overhead tracking and counting system using
RGB-D remote cameras,” in Applied Sciences, 2021.
[45] M.A.K. Quaid and A. Jalal, “Wearable sensors based human behavioral
pattern recognition using statistical features and reweighted genetic
Algorithm,” in Multimedia Tools and Applications, 2019.
[46] A. Ahmed, A. Jalal, and K. Kim, “A novel statistical method for scene
classification based on multi-object categorization and logistic regression,
in Sensors, 2020.
[47] S. Li, W. Zhang, and A.B. Chan, “Maximum-margin structured learning with
deep networks for 3D human pose estimation”, in Proc. on ICCV, pp. 2848
2856, 2015.
[48] A. Jalal, S. Kamal, and D. Kim, “Individual Detection-Tracking-Recognition
using depth activity images,” in Proc. on Ubiquitous Robots and Ambient
Intelligence, pp. 450-455, 2015.
[49] S. Kamal and A. Jalal, “A hybrid feature extraction approach for human
detection, tracking and activity recognition using depth sensors, in Arabian
Journal for Science and Engineering, vol. 41, pp. 1043-1051, 2016.
[50] M. Batool, A. Jalal, and K. Kim, “Sensors technologies for human activity
analysis based on SVM optimized by PSO algorithm,” in Proc. on ICAEM,
2019.
[51] B. Tahir, A. Jalal, and K. Kim, IMU sensor based automatic-features
descriptor for healthcare patient’s daily life-log recognition,” in Proc. on
Applied Sciences and Technology, 2021.
[52] M. Gochoo, S. Badar, A. Jalal, and K. Kim, “Monitoring real-time personal
locomotion behaviors over smart indoor-outdoor environments via body-
worn sensors,” in IEEE Access, 2021.
[53] U. Azmat and A. Jalal, “Smartphone inertial sensors for human locomotion
activity recognition based on template matching and codebook generation,”
in Proc. on Communication Technologies, 2021.
[54] A. Jalal, M.A.K. Quaid, and K. Kim, “A Wrist worn acceleration based
human motion analysis and classification for ambient smart home System,”
in Journal of Electrical Engineering & Technology, 2019.
[55] M. Batool, A. Jalal, and K. Kim, “Telemonitoring of daily activity using
accelerometer and gyroscope in smart home environments, in Journal of
Electrical Engineering and Technology, 2020.
[56] A. Jalal, M.A.K. Quaid, S.B. Tahir, and K. Kim, “A study of accelerometer
and gyroscope measurements in physical life-log activities detection
systems, in Sensors, 2020.
[57] A. Jalal, Y.H. Kim, Y.J. Kim, S. Kamal, and D. Kim, “Robust human activity
recognition from depth video using spatiotemporal multi-fused features, in
Pattern recognition, vol. 61, pp. 295-308, 2017.
[58] A. Jalal and S. Kamal, “Improved behavior monitoring and classification
using cues parameters extraction from camera array images,” in International
Journal of Interactive multimedia and Artificial Intelligence, vol. 5, 2018.
[59] K. Kim, A. Jalal, and M. Mahmood, “Vision-based human activity
recognition system using depth silhouettes: A smart home system for
monitoring the residents, in Journal of Electrical Engineering and
Technology, 2019.
[60] A. Jalal, S. Kamal, and D. Kim, “A depth video-based human detection and
activity recognition using multi-features and embedded hidden Markov
models for health care monitoring systems, in International Journal of
Interactive multimedia and Artificial Intelligence, vol. 4, pp. 54-62, 2017.
[61] A. Ahmed, A. Jalal, and K. Kim, “Multi‑objects detection and segmentation
for scene understanding based on Texton forest and kernel sliding
perceptron, in Journal of Electrical Engineering and Technology, 2020.
[62] I. Akhter, A. Jalal, and K. Kim, “Pose estimation and detection for event
recognition using Sense-Aware features and Adaboost classifier”, in Proc.
on IBCAST, 2021.
[63] A.A. Rafique, A. Jalal, and A. Ahmed, “Scene understanding and
recognition: statistical segmented model using geometrical features and
Gaussian naïve bayes, in Proc. on Applied and Engineering Mathematics,
2019.
[64] A. Ahmed, A. Jalal, and K. Kim, Region and decision tree-based
segmentations for multi-objects detection and classification in outdoor
scenes, in Proc. on Frontiers of Information Technology, 2019.
[65] A.A. Rafique, A. Jalal, and K. Kim, “Statistical multi-objects segmentation
for indoor/outdoor scene detection and classification via depth images, in
Proc. on Applied Sciences and Technology, 2020.
[66] A. Jalal, M. Mahmood, and M.A. Sidduqi, “Robust spatio-temporal features
for human interaction recognition via artificial neural network, in Proc. on
Frontiers of information technology, 2018.
[67] A. Jalal, S. Kamal, and C. Cecer, “Depth maps-based human segmentation
and action recognition using full-body plus body color cues via recognizer
engine, in Journal of Electrical Engineering & Technology, 2018.
[68] A. Jalal and M. Mahmood, “Students’ behavior mining in e-learning
environment using cognitive processes with information technologies,” in
Education and Information Technologies Springer, 2019.
[69] A. Jalal and S. Kim, “Algorithmic implementation and efficiency
maintenance of real-time environment using low-bitrate wireless
communication,” in Proc. on Software Technologies for Future Embedded
and Ubiquitous Systems, 2006.
[70] S. Abbasi, S. Kamal, M. Gochoo, A. Jalal, and D. Kim, “Affinity-based task
scheduling on heterogeneous multicore systems using CBS and QBICTM,”
in Applied Sciences, 2021.
[71] K. Nida, G. Y. Yazeed, M. Gochoo, A. Jalal, and K. Kim, “Semantic
recognition of human-object interactions via Gaussian-based elliptical
modelling and pixel-level labeling,” in IEEE Access, 2021.
[72] H. Ansar, A. Jalal, M. Gochoo, and K. Kim “Hand gesture recognition based
on auto‐landmark localization and reweighted genetic algorithm for
healthcare muscle activities”, in Sustainability, 2021.
[73] A. Jalal, A. Nadeem, and S. Bobasu, “Human body parts estimation and
detection for physical sports movements,” in Proc. on Communication,
Computing, and Digital Systems, 2019.
[74] S. Amna, A. Jalal, and K. Kim, “An Accurate Facial expression detector
using multi-landmarks selection and local transform features,” in Proc. on
IEEE conference, 2020.
[75] A. Jalal, S. Kamal, and D.S. Kim, Detecting complex 3D human motions
with body model low-rank representation for real-time smart activity
monitoring system,” KSII Transactions on Internet and Information
Systems, vol. 12, pp. 1189-1204, 2018.
[76] N. Amir, A. Jalal, and K. Kim, “Automatic human posture estimation for
sport activity recognition with robust body parts detection and entropy
markov model, in Multimedia Tools and Applications, 2021.
[77] A. Rafique, A. Jalal, and K. Kim, “Automated sustainable multi-object
segmentation and recognition via modified sampling consensus and kernel
sliding perceptron, in Symmetry, 2020.
[78] I. Akhter, A. Jalal, and K. Kim, “Adaptive pose estimation for gait event
detection using contextaware model and hierarchical optimization,” Journal
of Electrical Engineering and Technology, 2021.
[79] S.A. Rizwan, A. Jalal, M. Gochoo, and K. Kim, “Robust active shape model
via hierarchical feature extraction with SFS-optimized convolution neural
network for invariant human age classification,” in Electronics, vol. 10,
2021.
[80] M. Javeed, M. Gochoo, A. Jalal, and K. Kim, “HF-SPHR: Hybrid features
for sustainable physical healthcare pattern recognition using deep belief
networks”, in Sustainability, 2021.
[81] A. Ahmed, A. Jalal, and A.A. Rafique, “Salient segmentation based object
detection and recognition using hybrid genetic transform”, in Proc. on
ICAEM conference, 2019.
[82] F. Farooq, A. Jalal, and L. Zheng, “Facial expression recognition using
hybrid features and self-organizing maps,” in Proc. on Multimedia and Expo,
2017.
[83] M. Gochoo, I. Akhter, A. Jalal, and K. Kim, “Stochastic remote sensing
event classification over adaptive posture estimation via multifused data and
deep belief network”, in Remote Sensing, 2021.
[84] A. Jalal, A. Ahmed, A. Rafique, and K. Kim “Scene Semantic recognition
based on modified Fuzzy c-mean and maximum entropy using object-to-
object relations,” in IEEE Access, vol. 9, 2021.
[85] A. Jalal, I. Akhtar, and K. Kim, “Human posture estimation and sustainable
events classification via pseudo-2D stick model and k-ary tree hashing, in
Sustainability, 2020.
[86] A. Jalal, N. Khalid, and K. Kim, “Automatic recognition of human
interaction via hybrid descriptors and maximum entropy markov model
using depth sensors,” in Entropy, 2020.
[87] S. J. Berlin and M. John, “Human interaction recognition through deep
learning network,” in Proc. on ICCST, 2016.
[88] P. Lubina and M. Rudzki, “Artificial neural networks in accelerometer-based
human activity recognition,” in proc. on MIXDES, 2015.
[89] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould, “Dynamic
image networks for action recognition,” in Proc. on CVPR, pp. 30343042,
2016.
[90] G. Gkioxari, R. Girshick, and J. Malik, “Contextual action recognition with
R* CNN,” in Proc. on CVPR, pp. 10801088, 2015.
[91] A. Farzana, S. Abirami, and M. Sirvani, “A frame for captioning the human
interactions,” in Proc. on Advanced Computing, 2019.
[92] H. Fatta and U. Fajar, “Captioning image using convolutional neural network
(CNN) and long-short term memory (LSTM),” in International Seminar on
Research of Information Technology and Intelligent Systems, 2019.
[93] N. Otsu, A threshold selection method from gray-level histograms”, in
IEEE Trans. Sys. Man. Cyber, vol. 9, pp. 6266, 1979.
[94] K. Banerjee et al., “Exploring Alternatives to Softmax Function,” 2020.
[95] A. Shahroudy, J. Liu, T. Ng, and G. Wang, “NTU RGB+D: A large scale
dataset for 3D human activity analysis,” in Proc. on CVPR, 2016.
[96] J. Liu et al., “NTU RGB+D 120: A large-scale benchmark for 3D human
activity understanding”, in TPAMI, 2019.
[97] C. Coppola, S. Cosar, D.R. Faria, and N. Bellotto. “Automatic detection of
human interactions from RGB-D data for social activity classification, in
Proc. on RO-MAN, Lisbon, Portugal, 2017.
[98] L. Breiman, “Random forests”, in Machine Learning, vol. 45, pp. 532,
2001.
[99] S. Zhang, X. Liu, and J. Xiao. “On geometric features for skeleton-based
action recognition using multilayer lstm net-works,” in Proc. on WACV, pp.
148157, 2017.
[100] I. Lee, D. Kim, S. Kang, and S. Lee, “Ensemble deep learning for skeleton-
based action recognition using temporal sliding LSTM networks,” in Proc.
on ICCV, 2017.
[101] A. Shahroudy, T. Ng, Y. Gong, and G. Wang, “Deep multimodal feature
analysis for action recognition in RGB+D videos,” in TPAMI, vol. 40, pp.
10451058, 2018.
[102] J. Lee and B. Ahn, “Real-time human action recognition with a low-cost
RGB camera and mobile robot platform,” in Sensors, vol. 20, pp. 2886,
2020.
[103] M. Liu, H. Liu, and C. Chen, “Enhanced skeleton visual-ization for view
invariant human action recognition,” in Pattern Recognition, vol. 68, pp.
346362, 2017.
[104] C. Coppola, D.R. Faria, U. Nunes, and N. Bellotto, “Social activity
recognition based on probabilistic merging of skeleton features with
proximity priors from RGB-D data, in Proc. of the IEEE/RSJ IROS, 2016.
[105] M. Ehatisham-Ul-Haq et al., “Robust human activity recognition using
multimodal feature-level fusion, in IEEE Access, vol. 7, pp. 60736-60751,
2019.
... Researchers have faced many challenges while working on the task of recognizing human actions. The very first one is that human actions may vary significantly in terms of pose, appearance, speed, and context [14][15][16]. This makes developing robust algorithms that can accurately recognize actions in different settings quite difficult. ...
Article
Full-text available
Human action recognition is critical because it allows machines to comprehend and interpret human behavior, which has several real-world applications such as video surveillance, robot-human collaboration, sports analysis, and entertainment. The enormous variety in human motion and appearance is one of the most challenging problems in human action recognition. Additionally, when drones are employed for video capture, the complexity of recognition gets enhanced manyfold. The challenges including, the dynamic background, motion blur, occlusions, video capture angle, and exposure issues gets introduced that need to be taken care of. In this article, we proposed a system that deal with the mentioned challenges in drone recorded red-green-blue (RGB) videos. System first splits the video into its constituent frames and then performs a focused smoothing operation on the frames utilizing a bilateral filter. As a result, the foreground objects in the image gets enhanced while the background gets blur. After that a segmentation operation is performed using a quick shift segmentation algorithm that separates out human silhouette from the original video frame. The human skeleton was extracted from the silhouette, and key-points on the skeleton were identified. Thirteen skeleton key-points were extracted, including the head, left wrist, right wrist, left elbow, right elbow, torso, abdomen, right thigh, left thigh, right knee, left knee, right ankle, and left ankle. Using these key-points, we extracted normalized positions, their angular and distance relationship with each other, and 3D point clouds. By implementing an expectation maximization algorithm based on Gaussian mixture model, we drew elliptical clusters over the pixels using the key-points as the central positions to represent the human silhouette. Landmarks were located on the boundaries of these ellipses and were tracked from the beginning until the end of activity. After optimizing the feature matrix using a naïve Bayes feature optimizer, the classification is performed using a deep convolutional neural network. For our experimentation and the validation of our system, three benchmark datasets were utilized i.e., the UAVGesture, the DroneAction, and the UAVHuman dataset. Our model achieved a respective action recognition accuracy of 0.95, 0.90, and 0.44 on the mentioned datasets.
... To improve the design of YOLOv4 and make it more suitable for training on a single GPU, it primarily uses a modified Path Aggregation Network (PANet). The primary function of PANet is to enhance instance segmentation by preserving spatial data, which aids accurate pixel localization for the prediction of the mask [115], [116]. The crucial characteristics that give them their high level of accuracy for mask prediction are Bottom-up Path Augmentation, Fully-Connected Fusion, ...
Article
Full-text available
Road congestion, air pollution, and accident rates have all increased as a result of rising traffic density and worldwide population growth. Over the past recent years, the overall number of automobiles has increased significantly around the world. Therefore, an automated traffic monitoring system is essential for intelligent transportation management and control systems. The conventional traffic surveillance systems are based on local platforms which include the use of induction loops or static cameras mounted on the roadsides, poles, or bridges. These platforms often rely on expensive hardware which makes their implementation costly and also, lack flexibility and portability which constrains their deployment in different situations or areas. Whereas, aerial images can sense the traffic scenes with appropriate resolution over a broader area using mobile platforms. Although, there are many improved traffic monitoring systems have been introduced but still there are some challenges that need to be addressed. In this research, we have developed an efficient system for autonomous traffic monitoring based on aerial images. Moreover, the proposed model also classifies the detected vehicle into multiple vehicle classes. The proposed system involves seven steps. In the first step, all the input aerial images are pre-processed for noise removal and brightness adjustment using the defogging and gamma correction techniques respectively. Then, to separate the foreground and background, we used the segmentation technique. Next, we used the You Look Only Once (YOLO) algorithm for vehicle detection. To estimate the traffic density, we implemented a vehicle counter on each image frame. For vehicle classification, we implemented a Deep Belief Network classifier trained on Scale Invariant Feature Transform (SIFT) features. In the last stage, we used the DeepSORT tracker to track the vehicles across the extracted frames. An approximation of path trajectories followed by tracked vehicles is also performed. We used three publicly available datasets for experimentation. Different experiments have been conducted which shows the effectiveness of our proposed methodology.
... There are many multi-modality-based systems proposed by researchers in recent years [38][39][40][41][42]. In [43], a new deep learning and multi-modal data-based method has been suggested. ...
Conference Paper
Full-text available
Human activities have always been complex and most important concern for researchers especially when it comes to physical exercises. Multiple methods have been proposed for physical exercise recognition using different sensors where the conventional approaches focused on either videos or motion-based sensors. Whereas, the combination of both types of data can improve the physical exercise recognition particularly for complex motion patterns. For that reason, a hybrid hand-crafted cues-based method has been proposed in this paper. Data has been collected from the multi-modality-based datasets that are publicly available. Next, three different filters have been used to sift the noise from multiple sensors-based data. Then, an overlapping windowing technique along with human silhouette extraction has been utilized to pre-process the filtered data. Further, the hybrid hand-crafted cues have been extracted using linear prediction cepstral coefficients, Gaussian markov random field, and saliency maps. Finally, the cues have been reduced using multi-layer sequential forward selection methodology and the physical exercise activities have been classified using a deep belief network.
... However, due to the utilization of CNN-based features, the system was not able to achieve high accuracy rates. In [45], the authors have proposed an RGB and depth video frames-based IoT approach to detect human interactions. First, the key-frames have been extracted and images normalized followed by noise removal and region of interest extraction. ...
Conference Paper
Full-text available
Internet of things (IoT) represent the small devices connected together wirelessly collecting data to make lifestyle convenient. Inertial measurement units (IMU) and cameras connected together to collect data from multiple indoor activities can also support home surveillance systems. The traditional closed-circuit television is out-fashioned due to the huge volume of storage requirements and not connected together to notify users immediately of apprehensive activities. Therefore, this paper proposes an IoT-based surveillance system for indoor environments that will upkeep the security methods inside the home. For this purpose, the fused multi-sensors-based data is acquired from two state-of-the-art datasets, namely, Opportunity++ and CMU-MMAC. This acquired data from IoT devices is further pre-processed through multiple filtering techniques according to the type of data. Then, a skeleton model has been designed for the filtered video frame sequences. Furthermore, a bag of visual and motion features has been extracted using three different techniques followed by their discrimination. Finally, the IoT-based surveillance system detects indoor activities and provides feedback to the user.
... Secondly, estimated depth from RGB images using [9][10][11]. This method has been based on a neural network that is trained using images with its respective depth feature [12][13][14][15]. In third step, used the depth results with image processing filters to get the surface normal [16][17][18][19]. ...
Conference Paper
Full-text available
Creation of 3D models from a single RGB image is challenging problem in image processing these days, as the technology is in its early development stage. However, the demands for 3D technology and 3D reconstruction have been rapidly increasing nowadays. The traditional approach of computer graphics is to create a geometric model in 3D and try to reproduce it onto a 2D image with rendering. The major aim of the study is to create 3D models from 2D RGB image using machine learning techniques to be less computationally complex as compared to any deep learning algorithm. The proposed model has been based on three different modules such as: 2.5D features extraction, mesh generation, and 3D boundary detection. The ShapeNet dataset has been used for comparison. The testing results has shown an accuracy of 90.77 % in the plane class, 85.72% in the chair class, and 72.14% in the automobile class. The proposed model could be applicable to problems where reconstruction of 3D models is required such as: variations in geometric scale, mix of textured, uniformly colored, and reflective surfaces.
... The road mask is a binary image to locate the roads. The images are masked by multiplying the original image with the mask image pixel by pixel [34][35][36][37]. As the value of a black pixel is 0 when it gets multiplied with any other pixel value the resultant will be a 0. Thus, eliminating the irrelevant area. ...
Thesis
Full-text available
With the passage of time, human-computer interaction evolves. The use of traditional remote systems is replaced with hand gestures. Using hand gestures is an excellent way to communicate with others and operate different devices. This is done by training the system using different hand gestures. Recently, many datasets are available for hand gestures recognition model training used for various purposes. In our proposed model, we have trained our system using both machine and deep learning classifiers. We have adapted various pre-processing, hand detection, and feature descriptors methods for the efficient tracking and recognition of hand gestures. Our proposed work is focused on three applications i-e, hand gestures recognition for controlling smart home appliances, for e-learning and for medical specialists to communicate with the patients and to operate different electro-medical devices. We have used two datasets for each field and achieved remarkable accuracy rates against all datasets.
Article
Full-text available
The world’s expanding populace, the variety of human social factors, and the densely populated environment make humans feel uncertain. Individuals need a safety officer who generally deals with security viewpoints for this frailty. Currently, human monitoring techniques are time-consuming, work concentrated, and incapable. Therefore, autonomous surveillance frameworks are necessary for the modern day since they are able to address these problems. Nevertheless, hardships persist. The central concerns incorporate the detachment of the foreground from the scene and the understanding of the contextual structure of the environment for efficiently identifying unusual objects. In our work, we introduced a novel framework to tackle these difficulties by presenting a semantic segmentation technique for separating a foreground object. In our work, Super-pixels are generated using an improved watershed transform and then a conditional random field is implemented to obtain multi-object segmented frames by performing pixel-level labeling. Next, the Social Force model is introduced to extract the contextual structure of the environment via the fusion of a novel chosen particular histogram of an optical stream and inner force model. After using the computed social force, multi-people tracking is performed via three-dimensional template association using percentile rank and non-maximal suppression. Next, multi-object categorization is performed via deep learning Feature Pyramid Network. Finally, by considering the contextual structure of the environment, Jaccard similarity is utilized to make the decision for abnormality detection and identify the unusual objects from the scene. The invented framework is verified through rigorous investigations, and it obtained multi-people tracking efficiency of 92.2% and 89.1% over the UCSD and CUHK Avenue datasets. However, 95.2% and 93.7% abnormality detection efficiency is accomplished over UCSD and CUHK Avenue datasets, respectively.
Article
Full-text available
Over the past few years, automatic recognition of human interactions has drawn significant attention from researchers working in the field of Artificial Intelligence (AI). And feature extraction is one of the most critical tasks in developing efficient Human Interaction Recognition (HIR) systems. Moreover, recent researches in computer vision suggest that robust features lead to higher recognition accuracies. Hence, an improved HIR system has been proposed in this paper that combines 2D and 3D features extracted using machine learning and deep learning techniques. These discriminative features result in accurate classification and help avoid misclassification of similar interactions. Ten keyframes have been extracted from each video to reduce computational complexity. Next, these frames have been preprocessed using image normalization and noise removal techniques. The Region Of Interest (ROI), which contains the two humans involved in the interaction, has been extracted using motion detection. Then, the human silhouettes have been segmented using the GrabCut algorithm. Next, the extracted silhouettes have been converted into 3D meshes and their heat kernel signatures (HKS) have been obtained to extract key body points. A Convolutional Neural Network (CNN) has been used to extract full-body features from 2D full-body silhouettes. Then, topological and geometric features have been extracted from the key body points. Finally, the combined feature vector has been fed into Long Short-Term Memory (LSTM) and each interaction has been recognized using a Softmax classifier. The proposed system has been validated via extensive experimentation on three challenging RGB+D datasets. The recognition accuracies of 91.63%, 90.54%, and 90.13% have been achieved with the SBU Kinect Interaction, NTU RGB+D, and ISR-UoL 3D social activity datasets respectively. The results of extensive experiments performed on the proposed system suggest that it can be used effectively for various applications, such as security, surveillance, health monitoring, and assisted living.
Article
Full-text available
Human-Object Interaction (HOI) recognition, due to its significance in many computer vision-based applications, requires in-depth and meaningful details from image sequences. Incorporating semantics in scene understanding has led to a deep understanding of human-centric actions. Therefore, in this research work, we propose a semantic HOI recognition system based on multi-vision sensors. In the proposed system, the de-noised RGB and depth images, via Bilateral Filtering (BLF), are segmented into multiple clusters using a Simple Linear Iterative Clustering (SLIC) algorithm. The skeleton is then extracted from segmented RGB and depth images via Euclidean Distance Transform (EDT). Human joints, extracted from the skeleton, provide the annotations for accurate pixel-level labeling. An elliptical human model is then generated via a Gaussian Mixture Model (GMM). A Conditional Random Field (CRF) model is trained to allocate a specific label to each pixel of different human body parts and an interaction object. Two semantic feature types that are extracted from each labeled body part of the human and labelled objects are: Fiducial points and 3D point cloud. Features descriptors are quantized using Fisher’s Linear Discriminant Analysis (FLDA) and classified using K-ary Tree Hashing (KATH). In experimentation phase the recognition accuracy achieved with the Sports dataset is 92.88%, with the Sun Yat-Sen University (SYSU) 3D HOI dataset is 93.5% and with the Nanyang Technological University (NTU) RGB+D dataset it is 94.16%. The proposed system is validated via extensive experimentation and should be applicable to many computer-vision based applications such as healthcare monitoring, security systems and assisted living etc.
Article
Full-text available
This work presents the grouping of dependent tasks into a cluster using the Bayesian analysis model to solve the affinity scheduling problem in heterogeneous multicore systems. The non-affinity scheduling of tasks has a negative impact as the overall execution time for the tasks increases. Furthermore, non-affinity-based scheduling also limits the potential for data reuse in the caches so it becomes necessary to bring the same data into the caches multiple times. In heterogeneous multicore systems, it is essential to address the load balancing problem as all cores are operating at varying frequencies. We propose two techniques to solve the load balancing issue, one being designated “chunk-based scheduler” (CBS) which is applied to the heterogeneous systems while the other system is “quantum-based intra-core task migration” (QBICTM) where each task is given a fair and equal chance to run on the fastest core. Results show 30–55% improvement in the average execution time of the tasks by applying our CBS or QBICTM scheduler compare to other traditional schedulers when compared using the same operating system.
Article
Full-text available
Featured Application: The proposed technique is an application for people detection and counting which is evaluated over several challenging benchmark datasets. The technique can be applied in heavy crowd assistance systems that help to find targeted persons, to track functional movements and to maximize the performance of surveillance security. Abstract: Automatic head tracking and counting using depth imagery has various practical applications in security, logistics, queue management, space utilization and visitor counting. However, no currently available system can clearly distinguish between a human head and other objects in order to track and count people accurately. For this reason, we propose a novel system that can track people by monitoring their heads and shoulders in complex environments and also count the number of people entering and exiting the scene. Our system is split into six phases; at first, prepro-cessing is done by converting videos of a scene into frames and removing the background from the video frames. Second, heads are detected using Hough Circular Gradient Transform, and shoulders are detected by HOG based symmetry methods. Third, three robust features, namely, fused joint HOG-LBP, Energy based Point clouds and Fused intra-inter trajectories are extracted. Fourth, the Apriori-Association is implemented to select the best features. Fifth, deep learning is used for accurate people tracking. Finally, heads are counted using Cross-line judgment. The system was tested on three benchmark datasets: the PCDS dataset, the MICC people counting dataset and the GOTPD dataset and counting accuracy of 98.40%, 98%, and 99% respectively was achieved. Our system obtained remarkable results.
Article
Full-text available
Based on the rapid increase in the demand for people counting and tracking systems for surveillance applications, there is a critical need for more accurate, efficient, and reliable systems. The main goal of this study was to develop an accurate, sustainable, and efficient system that is capable of error-free counting and tracking in public places. The major objective of this research is to develop a system that can perform well in different orientations, different densities, and different backgrounds. We propose an accurate and novel approach consisting of preprocessing, object detection, people verification, particle flow, feature extraction, self-organizing map (SOM) based clustering, people counting, and people tracking. Initially, filters are applied to preprocess images and detect objects. Next, random particles are distributed, and features are extracted. Subsequently, particle flows are clustered using a self-organizing map, and people counting and tracking are performed based on motion trajectories. Experimental results on the PETS-2009 dataset reveal an accuracy of 86.9% for people counting and 87.5% for people tracking, while experimental results on the TUD-Pedestrian dataset yield 94.2% accuracy for people counting and 94.5% for people tracking. The proposed system is a useful tool for medium-density crowds and can play a vital role in people counting and tracking applications.
Article
Full-text available
The monitoring of human physical activities using wearable sensors, such as inertial-based sensors, plays a significant role in various current and potential applications. These applications include physical health tracking, surveillance systems, and robotic assistive technologies. Despite the wide range of applications, classification and recognition of human activities remains imprecise and this may contribute to unfavorable reactions and responses. To improve the recognition of human activities, we designed a dataset in which ten participants (five male and five female) performed 11 different activities wearing three body-worn inertial sensors in different locations on the body. Our model extracts data via a hierarchical feature-based technique. These features include time, wavelet, and time-frequency domains, respectively. Stochastic gradient descent (SGD) is then introduced to optimize selective features. The selected features with optimized patterns are further processed by multi-layered kernel sliding perceptron to develop adaptive learning for the classification of physical human activities. Our proposed model was experimentally evaluated and applied on three benchmark datasets: IM-WSHA, a self-annotated dataset, PAMAP2 dataset which is comprised of daily living activities, and an HuGaDB, a dataset which contains physical activities for aging people. The experimental results show that the proposed method achieves better results and outperforms others in terms of recognition accuracy, achieving an accuracy rate of 83.18%, 94.16%, and 92.50% respectively, when IM-WSHA, PAMAP2, and HuGaDB datasets are applied.
Article
Full-text available
To understand daily events accurately, adaptive pose estimation (APE) systems require a robust context-aware model and optimal feature selection methods. In this paper, we propose a novel gait event detection (GED) system that consists of sali-ency silhouette detection, a robust body parts model and a 2D stick-model followed by a hierarchical optimization algorithm. Furthermore, the most prominent context-aware features such as energy, 0–180° intensity and distinct moveable features are proposed by focusing on invariant and localized characteristics of human postures in different event classes. Finally, we apply Grey Wolf optimization and a genetic algorithm to discriminate complex postures and to provide appropriate labels to each event. In order to evaluate the performance of proposed GED, two public benchmark datasets, UCF101 and YouTube, are examined via the n-fold cross validation method. For the two benchmark datasets, our proposed method detects the human body key points with 82.4% and 83.2% accuracy respectively. Also, it extracts the context-aware features and finally recognizes the gait events with 82.6% and 85.0% accuracy, respectively. Compared with other well-known statistical and state-of-the-art methods, our proposed method outperforms other similarly tasked methods in terms of posture detection and recognition accuracy (PDF) Adaptive Pose Estimation for Gait Event Detection Using Context-Aware Model and Hierarchical Optimization. Available from: https://www.researchgate.net/publication/351047404_Adaptive_Pose_Estimation_for_Gait_Event_Detection_Using_Context-Aware_Model_and_Hierarchical_Optimization [accessed Apr 22 2021].