Content uploaded by Ahmad Jalal
Author content
All content in this area was uploaded by Ahmad Jalal on Nov 05, 2021
Content may be subject to copyright.
2021 International Conference on Innovative Computing (ICIC)
978-1-6654-0091-6/21/$31.00 ©2021 IEEE
A Novel Deep Learning Model for Understanding
Two-Person Interactions Using Depth Sensors
Manahil Waheed
Dept. of Creative Technologies,
Air University
Islamabad, Pakistan
manahilwaheedar@gmail.com
Madiha Javeed
Dept. of Computer Science,
Air University
Islamabad, Pakistan
191880@students.au.edu.pk
Ahmad Jalal
Dept. of Computer Science
Air University
Islamabad, Pakistan
ahmadjalal@mail.au.edu.pk
Abstract—Despite the ever-increasing efforts made in the field of
data science and artificial intelligence, the task of automatic human
interaction recognition remains challenging. Advanced computer
vision sensors like depth sensors have made it easier to achieve the
goal of accurate recognition of human interactions in complex
situations. The reason for their success is that they are robust against
lighting and illumination variation and are insensitive to color and
texture changes. Therefore, the proposed system combines both RGB
and depth images to train a Convolutional Neural Network (CNN).
The robust features extracted from CNN have been classified using
a Softmax classifier. Two publically available large RGB-D datasets
have been used for training and evaluating the performance of the
proposed method. The proposed method has achieved an accuracy of
87.03% with the NTU RGB+D dataset and 86.21% with the UoL 3D
Social Interaction dataset.
Keywords—convolution neural network, deep learning, depth
videos, human interaction recognition, softmax classifier.
I. INTRODUCTION
Human interaction recognition (HIR) refers to the task of
understanding a mutual activity performed by two human beings.
This field has attracted many researchers owing to its wide range
of applications, including security [1-4], smart homes [5-10],
content-based video retrieval [11-14], healthcare [15-20],
surveillance [21-24], and human tracking [25-29]. However, it is
a complex task because of multiple reasons, such as change of
viewpoint, occlusion, variation in clothing and lighting
conditions, low-resolution images, and unavailability of large
datasets. Some progress has been made ever since the introduction
of low-cost depth sensors, such as Microsoft Kinect [30-32],
because they are not as affected by lightning conditions as RGB
cameras.
This research proposes a fusion of RGB and depth images to
train a CNN model. The UoL (University of Lincoln) 3-D Social
Interaction dataset provides RGB and depth images. The NTU
RGB+D (Nanyang Technological University's Red Blue Green
and Depth) dataset comprises RGB videos and the corresponding
depth maps. Hence, the RGB videos have been converted into
image frames. To reduce the computational complexity, only 10
keyframes have been selected from each video. The keyframes
have been extracted by comparing the histograms of consecutive
frames. The differences between the histograms of every two
consecutive image frames have been stored in an array and the
frames with the highest differences have been selected as
keyframes afterward. Once keyframes have been extracted from
RGB videos, they have been combined with the corresponding
depth frames. Next, the 4-dimensional images have been fed to a
CNN model that uses VGG-16 (Visual Geometry Group-16 layers
deep) [33] as the base model. Finally, a Softmax classifier has
been proposed for classification.
Similar research work is described in Section II and the
proposed methodology is discussed in Section III. Section IV
presents the implementation details and results of the proposed
method. The conclusion of the research is given in Section V.
II. RELATED WORK
Recent years have seen a lot of progress in the field of human
activity recognition [34-38]. However, identifying interactions
between two human beings is a more challenging task [39]. For
this purpose, many researchers have preferred RGBD data over
RGB data [40-45]. With the availability of this additional depth
information, depth gradients can also be used to extract local
features [46-49]. Moreover, both sensor-based [50-56] and vision-
based [57-60] HIR systems have been developed in the past.
The first step in recognizing human interactions in videos is to
represent events and scenes as image features [61-65]. Based on
those features, an interaction class is assigned to the input video
[66-70]. Another important step during feature extraction is the
identification of key body parts [71-74] and pose estimation [75-
77]. A common approach is to extract hierarchical features
[78,79] from human bodies. Some researchers have also chosen
hybrid features for better classification results [80-83]. For
example, researchers in [84] used a combination of different
blobs, multiple orientations, Fourier transforms, and
geometrical points over the objects as features. A. Jalal et al. [85]
extracted various features, including energy, sine, distinct body
parts movements, and a 3D Cartesian view of smoothing gradients
features. Similarly, a hybrid of four different local descriptors was
used by the authors of [86], i.e., spatio-temporal features, energy-
based features, shape-based angular and geometric features, and a
motion-orthogonal histogram of oriented gradient (MO-HOG).
CNN has very extensively been used for classification purposes
[87-90]. And it has also proved efficient as a feature extractor.
[91,92] have used CNN as an encoder for their image captioning
systems while using inception V3 as the base model.
III. METHODOLOGY
This section discusses the proposed methodology for HIR. The
system takes both RGB and depth videos as input. The videos are
first converted into images at the rate of 31 frames per second and
then 10 keyframes are extracted from each video. Pre-processing
is done over the extracted keyframes to enhance the image quality,
making it easier to extract the desired features. These pre-
processed images are then used for human detection and
segmentation. The segmented RGB and depth images are
concatenated and then fed to a CNN model, which extracts
important features from them. These features are then given to the
Softmax classifier that generates the class labels. Fig. 1 shows an
overview of this method.
Fig. 1. A general overview of the proposed architecture.
A. Preprocessing
The NTU RGB+D dataset provides RGB videos and depth
frames. Hence, the RGB videos have been converted into frames
to get the same number of frames against each RGB video as the
depth maps. Since the dataset has 48 videos per class and this
research uses 11 classes, it is computationally very expensive to
keep all the extracted frames. Therefore, only 10 keyframes have
been extracted from each video. The keyframes have been
extracted by computing the differences between the histograms of
every two consecutive frames. The top ten frames corresponding
to the highest differences have been selected. All RGB and depth
images have been cropped to obtain the desired regions. Then they
have been pre-processed using multiple techniques discussed in
detail below. Applying such preprocessing techniques helps
improve the overall accuracy of the system.
1) Histogram Equalization:
To improve the quality of the images, the contrast is enhanced
by adjusting the intensity values. This is done by the histogram
equalization technique. Eq. 1 represents the normalized histogram
of an image , as indicated by eq. (1).
(1)
where is the number of pixels with intensity , ranges
from 0 to ( is 256), and is the total number of pixels in
the image. The histogram equalized image is defined by eq. (2).
(2)
The results of histogram equalization are shown in Fig. 2.
(a) (b) (c) (d)
Fig. 2. Histogram Equalization. (a) original image, (b) histogram of original image,
(c) image after histogram equalization, and (d) histogram of the equalized image.
2) Image Smoothing
After histogram equalization, all images have been de-noised
using mean filtering. In this method, each pixel value in the image
has been replaced by the mean of its neighboring pixels, as shown
in eq. (3).
(3)
where and j are the pixel values and is the window size,
i.e., the number of neighboring pixels.
B. Image Segmentation
Image segmentation reduces the complexity of the image as it
returns only the desired part of the image. Moreover, it makes
sense to remove the redundant background from all the images so
the features of the background, which will be the same for
different classes, do not play a part while determining the
interaction class. In order to segment human beings from RGB
and depth images, two image segmentation techniques have been
used, as discussed in detail below.
1) RGB Image Segmentation
Humans have been segmented from the RGB images using the
edge detection technique. Edges or boundaries are detected based
on discontinuity in the intensity values of the pixels. For this
purpose, all RGB images are first converted into grayscale images
and a binary silhouette is extracted using the detected edges. A
floor detection and removal technique has also been implemented
for the NTU RGB+D dataset where the floor often gets
misclassified as the foreground. Based on the range of its intensity
values, a floor mask has been created, which is then used to remove
the floor. The original RGB pixel values are then restored in the
detected binary silhouettes to get the desired RGB silhouettes. Fig.
3 shows the results of the RGB image segmentation stage.
(a)
(b)
Fig. 3. RGB silhouette segmentation (a) ‘hugging’ interaction (NTU RGB+D)
original image (left), binary silhouette (center), and segmented image (right), (b)
‘shaking hands’ interaction (UoL 3D) original image (left), binary silhouette
(center), and segmented image (right).
2) Depth Image Segmentation
For depth images, Ostu’s thresholding technique has been used
[93]. The intensity values of the depth images available in the NTU
RGB+D dataset have been adjusted as they were too dark to be
seen. Then the cropped and intensity-adjusted images have been
segmented using Ostu’s thresholding technique shown in eq. (4).
Fig. 4 shows the results of the depth image segmentation stage.
(4)
where
is the weighted sum of intra-class variances of the
two classes (foreground and background) and is the threshold
value.
(a)
(b)
Fig. 4. Depth silhouette segmentation (a) ‘kicking’ interaction (NTU RGB+D)
original intensity-adjusted image (left), binary silhouette (center), and segmented
image (right) (b) ‘help stand up’ interaction (UoL 3D) original image (left), binary
silhouette (center), and segmented image (right).
C. Feature Extraction via CNN
For extraction of features from images, a Convolutional Neural
Network (CNN) has been used. The transfer learning approach
has been employed, which includes using VGG16 as the base
model and then fine-tuning its weights according to the used
datasets. VGG16 is a CNN model that achieves 92.7% on the
ImageNet dataset which has 1000 classes. Fig. 5 shows all the
layers in the VGG16 model.
Fig. 5. Different layers of the VGG16 architecture with configurations.
First, the testing and training input images have been trained
on the VGG16 model and the resulting images having the
dimensions of 7x7x512 have been obtained. Then these have been
trained on the proposed CNN model which has three
convolutional layers with 128, 64, and 32 filters respectively. The
convolution layers compute the output of neurons that are
connected to local regions in the input. Convolution is similar to
sliding a filter over an image, computing the dot product of filter
weights and image pixels. Rectified Linear Unit (RELU) has been
used as the activation function for all three convolution layers. It
simply rounds up all the negative values to zero. Then a batch
normalization layer followed by a flatten layer has been used.
Lastly, a drop out layer of 0.2 has been used to avoid overfitting.
Fig. 6 shows the layers in the proposed model. Table I shows a
summary of the proposed CNN model.
Fig. 6. A general overview of the layers in our CNN model.
Conv:128
Conv:64
Conv:32
BatchNorm
Flatten
Drop out: 0.2
7x7x512
1x1x40
96
1x1x10
00
Convolution
2
Max pooling
Flatten
Softmax
1x1x40
96
TABLE I. A BRIEF SUMMARY OF OUR CNN MODEL
Layer
Output Shape
Parameters
Conv:128
(None,7,7,128)
65664
Conv:64
(None,7,7,64)
8256
Con:32
(None,7,7,32)
2080
BatchNorm
(None,7,7,32)
128
Flatten
(None,7,7,1568)
0
Dropout
(None,7,7,1568)
0
Softmax
(None,7,7,11)
17259
D. Human Interaction Recognition Using Softmax
After extracting features using CNN, the softmax classifier has
been used to recognize human interactions. The softmax function
is a popular choice for multiclass classification [94]. It readjusts
the probabilities of all the classes such that the sum of their
probabilities is 1. The softmax output for each class is computed
using eq. (5).
(5)
where is the probability of each class, is the class, and n is
the total number of classes. The normalized sum of these
probabilities is always equal to 1.
IV. EXPERIMENTAL SETUP AND RESULTS
This section gives a brief description of the datasets used for
experimentation, the implementation details, and the results of
different experiments conducted to evaluate the performance of the
proposed HIR model. The results also contain a comparison of the
proposed system’s accuracy with that of other state-of-the-art
systems.
Datasets
1) NTU RGB+D dataset
The NTU RGB+D dataset [95,96] consists of 60 classes, 11 of
which are two-person interactions: punching, kicking, pushing, pat
on back, point finger, hugging, giving object, touch pocket,
shaking hands, walking towards, and walking apart. There are 48
videos for each interaction class. Each session has three sets of
videos since each video is recorded from three different
viewpoints. Fig. 7 shows a few sample frames from this dataset.
(a)
(b)
Fig.7. NTU RGB+D dataset. (a) RGB frame for ‘giving object’ (left), RGB frame
for ‘punching’ (center), and RGB frame for ‘pat on back’ (right), (b) depth frame
for ‘giving object’ (left), depth frame for ‘punching’ (center), and depth frame for
‘pat on back’ (right).
2) UoL 3D Social Interaction dataset
The UoL 3D social interaction dataset [97] provided RGB+D
videos and skeleton information of 8 interaction classes: shaking
hands, talk, help walk, help stand up, hug, push, fight, and draw
attention. This dataset includes ten sessions, each comprising two
long videos containing all eight interactions. The skeleton tracks
are provided in a text format. Information about 25 skeleton joints
is provided. Fig. 8 shows a few sample frames from this dataset.
(a)
(b)
Fig.8. UoL 3D Social Interaction dataset. (a) RGB frame for ‘hugging’ (left), RGB
frame for ‘shaking hands’ (center), and RGB frame for ‘kicking’ (right), (b) depth
frame for ‘hugging’ (left), depth frame for ‘shaking hands’ (center), and depth
frame for ‘kicking’ (right).
Implementation
The proposed CNN model has been developed in Python using
Jupyter Notebook. Python’s deep learning library, Keras, has been
used as it provides the VGG-16 model and different layers for
convolution, batch normalization, flattening, drop out, and
softmax. The proposed model has been trained for 30 epochs. Fig.
9 shows how the model’s accuracy increased and loss decreased
with the increase in the number of epochs.
(a)
(b)
Fig.9. Accuracy and loss graphs: (a) model accuracy increased with increasing
epochs and (b) model loss decreased with increasing epochs.
Results
To evaluate the performance of the proposed system, a one-
third training validation test has been used. Tables II and III show
the average accuracies achieved by the proposed model and the
accuracies achieved per interaction class over the NTU RGB+D
and UoL 3D Social Interaction datasets respectively.
TABLE II. RECOGNITION ACCURACIES OF CLASSES OF NTU RGB+D DATASET
Class
Accuracy
(%)
Class
Accuracy
(%)
Class
Accuracy
(%)
punching
82.23
pat on
back
79.94
giving
object
80.76
kicking
94.72
point
finger
95.15
touch
pocket
83.05
pushing
82.02
hugging
95.15
shaking
hands
92.51
walking
towards
89.24
walking
apart
82.56
average
accuracy
87.03
TABLE III. RECOGNITION ACCURACIES OF CLASSES OF UOL 3D DATASET
Class
Accuracy
(%)
Class
Accuracy
(%)
Class
Accuracy
(%)
handshake
81.12
help
stand
up
83.21
fight
86.04
talk
88.45
hug
90.02
draw
attention
88.21
walking
towards
89.24
walking
apart
82.56
average
accuracy
87.03
Tables IV and V show a comparison of the results of the
proposed system with two state-of-the-art (SOTA) classifiers:
Bayesian and Random Forest [98]. The proposed method also
outperforms some recent state-of-the-art methods as shown in
Tables VI and VII.
Table IV. COMPARISON OF THE PROPOSED METHOD WITH SOTA CLASSIFIERS
OVER NTU RGB+D DATASET
Interaction
Class
Recognition Accuracy (%)
Random Forest
Bayesian
Proposed method
punching
47.43
96.41
82.23
kicking
91.56
72.05
94.72
pushing
86.74
92.56
82.02
pat on back
79.48
84.36
79.94
point finger
91.56
79.49
95.15
hugging
87.74
74.62
95.15
giving object
73.07
71.54
80.76
touch pocket
72.35
76.67
83.05
shaking hands
87.17
83.07
92.51
walking towards
88.74
70.26
89.24
walking apart
50.24
70.26
82.56
average accuracy
77.82
79.21
87.03
Table V. COMPARISON OF THE PROPOSED METHOD SOTA CLASSIFIERS OVER UOL
3D DATASET
Interaction
Class
Recognition Accuracy (%)
Random
Forest
Bayesian
Proposed method
handshake
52.43
84.41
81.12
talk
85.71
82.05
88.45
help walk
82.45
72.56
85.32
help stand up
79.48
74.36
83.21
hug
85.71
79.49
90.02
push
78.34
74.62
87.30
fight
77.07
81.54
86.04
draw attention
74.35
76.67
88.21
average
accuracy
76.94
78.21
86.21
TABLE VI. COMPARISON OF THE PROPOSED METHOD WITH SOTA METHODS OVER
NTU RGB+D DATASET
Authors
Methods
Accuracy (%)
Songyang et al.[99]
geometric features
70.26
Inwoong et al.[100]
ensemble TS-LSTM v2
74.60
Amir et al.[101]
deep multimodal features
74.9
Junwoo et al.[102]
mobile robot platform
75.0
Mengyuan et
al.[103]
enhanced skeletal visualization
75.97
proposed model
87.03
TABLE VII. COMPARISON OF THE PROPOSED METHOD WITH SOTA METHODS OVER
UOL 3D DATASET
Authors
Methods
Accuracy
(%)
Claudio et al.[104]
probabilistic merging of
skeletal features
85.1
Muhammad et
al.[105]
multimodal feature level fusion
85.12
Claudio et al.[97]
statistical and geometrical
features
85.56
proposed model
86.21
V. CONCLUSION
In this paper, an HIR system has been proposed that efficiently
recognizes complex human-to-human interactions using both
RGB and depth information. The performed experiments have
shown that RGB-D images give better results than RGB images.
Furthermore, using only 10 keyframes instead of the entire videos
takes less time in model training.
As future work, the researchers plan to explore new and better
ways of fusing RGB and depth images for a more efficient
classification system. It is also intended to train and evaluate the
proposed system on more challenging datasets.
REFERENCES
[1] O. Aran and D. Gatica-Perez, “One of a kind: Inferring personality
impressions in meetings,” in Proc. on ICMI (ACM), pp. 11–18, 2013.
[2] A. Jalal, S. Kamal, and D. Kim, “Depth map-based human activity tracking
and recognition using body joints features and self-organized map,” in Proc.
on CCNT, pp. 1-6, 2014.
[3] A. Jalal and Y. Kim, “Dense depth maps-based human pose tracking and
recognition in dynamic scenes using ridge data,” in Proc. on Advanced
Video and Signal-based Surveillance, pp. 119-124, 2014.
[4] A. Jalal, S. Kamal, and D. Kim, “Shape and motion features approach for
activity tracking and recognition from Kinect video camera,” in Proc. on
Advanced Information Networking and Applications Workshops, pp. 445-
450, 2015.
[5] A. Jalal, N. Sharif, J.T. Kim, and T.S. Kim, “Human activity recognition via
recognized body parts of human depth silhouettes for residents monitoring
services at smart homes,” in Indoor and Built Environment, vol. 22, pp. 271-
279, 2013.
[6] A. Jalal, M.A.K. Quaid, and M.A. Sidduqi, “A triaxial acceleration-based
human motion detection for ambient smart home system,” in Proc. on
Applied Sciences and Technology, 2019.
[7] A. Jalal, S. Lee, J. Kim, and T. Kim, “Human activity recognition via the
features of labeled depth body parts,” in Proc. on Smart Homes Health
Telematics, pp. 246-249, 2012.
[8] A. Jalal, J.T. Kim, and T.S Kim, “Development of a life logging system via
depth imaging-based human activity recognition for smart homes,” in Proc.
on Sustainable Healthy Buildings, pp. 91-95, 2012.
[9] A. Jalal, S. Kamal, and D. Kim, “A depth video sensor-based life-logging
human activity recognition system for elderly care in smart indoor
environments,” in Sensors, vol. 14, pp. 11735-11759, 2014.
[10] T. Kim, A. Jalal, H. Han, H. Jeon, and J. Kim, “Real-time life logging via
depth imaging-based human activity recognition towards smart homes
services,” in Proc. on Renewable Energy Sources and Healthy Buildings, pp.
63, 2013.
[11] G.H. Liu, J.Y. Yang, and Z. Li, “Content-based image retrieval using
computational visual attention model,” in Pattern Recognition, vol. 48, pp.
2554–2566, 2015.
[12] S. Sempena, N.U. Maulidevi, and P.R. Aryan, “Human action recognition
using dynamic time warping,” in Proc. on ICEEI, pp. 1–5, 2011.
[13] A. Jalal, S. Kamal, and D. Kim, “Facial expression recognition using 1D
transform features and hidden markov model,” in Journal of Electrical
Engineering & Technology, vol. 12, pp. 1657-1662, 2017.
[14] M. Mahmood, A. Jalal, and H. A. Evans, “Facial expression recognition in
image sequences using 1D transform and gabor wavelet transform,” in Proc.
on Applied and Engineering Mathematics, 2018.
[15] A. Jalal, M. Batool, and K. Kim, “Stochastic recognition of physical activity
and healthcare using tri-axial inertial wearable sensors,” in Applied
Sciences, 2020.
[16] A. Jalal, M.A.K. Quaid, S.B. Tahir, and K. Kim, “A study of accelerometer
and gyroscope measurements in physical life-log activities detection
systems,” in Sensors, 2020.
[17] A. Jalal, M. Batool, and K. Kim, “Sustainable wearable system: human
behavior modeling for life-logging activities using k-ary tree hashing
classifier,” in Sustainability, 2020.
[18] M. Javeed, A. Jalal, and K. Kim, “Wearable sensors based exertion
recognition using statistical features and random forest for physical
healthcare monitoring,” in Proc. on Applied Sciences and Technology, 2021.
[19] A. Jalal, M. Batool and B. Tahir, “Markerless sensors for physical health
monitoring system using ECG and GMM feature extraction,” in Proc. on
IBCAST, 2021.
[20] A. Jalal, M.A.K. Quaid, and A.S. Hasan, “Wearable sensor-based human
behavior understanding and recognition in daily life for smart
environments,” in Proc. on Frontiers of Information Technology, 2018.
[21] A. Shehzad, A. Jalal, and K. Kim, “Multi-person tracking in smart
surveillance system for crowd counting and normal/abnormal events
detection,” in Proc. on Applied and Engineering Mathematics, 2019.
[22] P. Mahwish, G. Yazeed, M. Gochoo, A. Jalal, S. Kamal, and D. Kim, “A
smart surveillance system for people counting and tracking using particle
flow and modified SOM,” in Sustainability, 2021.
[23] P. Mahwish, A. Jalal, and K. Kim, “Hybrid algorithm for multi people
counting and tracking for smart surveillance,” in Proc. on IBCAST, 2021.
[24] N. Khalid, M. Gochoo, A. Jalal, and K. Kim, “Modeling two-person
segmentation and locomotion for stereoscopic action identification: a
sustainable video surveillance system,” in Sustainability, 2021.
[25] A. Jalal, Y. Kim, S. Kamal, A. Farooq, and D. Kim, “Human daily activity
recognition with joints plus body features representation using Kinect
sensor,” in Proc. on Informatics, Electronics, and Vision, 2015.
[26] A. Jalal, S. Kamal, A. Farooq, and D. Kim, “A spatiotemporal motion
variation features extraction approach for human tracking and pose-based
action recognition,” in Proc. on Informatics, Electronics, and Vision, 2015.
[27] A. Nadeem, A. Jalal, and K. Kim, “Human actions tracking and recognition
based on body parts detection via artificial neural network,” in Proc. on
Advancements in Computational Sciences, 2020.
[28] S. Kamal, A. Jalal, and D. Kim, “Depth images-based human detection,
tracking and activity recognition using spatiotemporal features and Modified
HMM, in Journal of Electrical Engineering and Technology, pp. 1921-1926,
2016.
[29] A. Jalal, M. Mahmood, and A. S. Hasan, “Multi-features descriptors for
human activity tracking and recognition in Indoor-outdoor environments,”
in Proc. on Applied Sciences and Technology, 2019.
[30] M. Asadi-Aghbolaghi, et al., “A survey on deep learning based approaches
for action and gesture recognition in image sequences,” in Proc. on
Automatic Face & Gesture Recognition, 2017.
[31] A. Jalal, S. Kamal, and D. Kim, “Human depth sensors-based activity
recognition using spatiotemporal features and hidden markov model for
smart environments,” in Journal of Computer Networks and
Communications, vol. 2016, pp. 1-11, 2016.
[32] A. Jalal, S. Kamal, and D. Kim, “Depth Silhouettes Context: A new robust
feature for human tracking and activity recognition based on embedded
HMMs,” in Proc. on Ubiquitous Robots and Ambient Intelligence, pp. 294-
299, 2015.
[33] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” in Proc. on Learning Representations, 2015.
[34] A. Jalal, J.T. Kim, and T.S. Kim, “Human activity recognition using the
labeled depth body parts information of depth silhouettes,” in Proc. on
Sustainable Healthy Buildings, pp. 1-8, 2012.
[35] A. Jalal, M.Z. Uddin, and T.S. Kim, “Depth video-based human activity
recognition system using translation and scaling invariant features for life
logging at smart home,” in IEEE Transaction on Consumer Electronics, vol.
58, pp. 863-871, 2012.
[36] A. Jalal, M.Z. Uddin, J.T. Kim, and T.S. Kim, “Daily human activity
recognition using depth Silhouettes and R transformation for smart home,”
in Proc. on Smart Homes Health Telematics, pp. 25-32, 2011.
[37] S. Badar, A. Jalal, and M. Batool, “Wearable Sensors for activity analysis
using SMO-based random forest over smart home and sports datasets”, in
Proc. on ICACS, 2020.
[38] S. Badar, A. Jalal, and K. Kim, “Wearable inertial sensors for daily activity
analysis based on Adam optimization and the maximum entropy markov
model”, in Entropy, vol. 22, pp. 1-19, 2020.
[39] A. Stergiou and R Poppe1, “Understanding human-human interactions: a
survey,” in Computer Vision and Image Understanding, 2019.
[40] A. Jalal, Y. Kim, and D. Kim, “Ridge body parts features for human pose
estimation and recognition from RGB-D video data,” in Proc. on Computing,
Communication and Networking Technologies, pp. 1-6, 2014.
[41] M. Mahmood, A. Jalal, and K. Kim, “WHITE STAG model: Wise human
interaction tracking and estimation (WHITE) using spatio-temporal and
angular-geometric (STAG) Descriptors”, in Multimedia Tools and
Applications, 2020.
[42] A. Farooq, A. Jalal, and S. Kamal, “Dense RGB-D map-based human
tracking and activity recognition using skin joints features and self-
organizing map,” in KSII Transactions on internet and information systems,
vol. 9, pp. 1856-1869, 2015.
[43] A. Ahmed, A. Jalal, and K. Kim, “RGB-D images for object segmentation,
localization and recognition in indoor scenes using feature descriptor and
Hough voting”, in Proc. on Applied Sciences and Technology, 2020.
[44] M. Gochoo, S.R. Amna, G. Yazeed, A. Jalal, S. Kamal, and D. Kim, “A
systematic deep learning based overhead tracking and counting system using
RGB-D remote cameras,” in Applied Sciences, 2021.
[45] M.A.K. Quaid and A. Jalal, “Wearable sensors based human behavioral
pattern recognition using statistical features and reweighted genetic
Algorithm,” in Multimedia Tools and Applications, 2019.
[46] A. Ahmed, A. Jalal, and K. Kim, “A novel statistical method for scene
classification based on multi-object categorization and logistic regression,”
in Sensors, 2020.
[47] S. Li, W. Zhang, and A.B. Chan, “Maximum-margin structured learning with
deep networks for 3D human pose estimation”, in Proc. on ICCV, pp. 2848–
2856, 2015.
[48] A. Jalal, S. Kamal, and D. Kim, “Individual Detection-Tracking-Recognition
using depth activity images,” in Proc. on Ubiquitous Robots and Ambient
Intelligence, pp. 450-455, 2015.
[49] S. Kamal and A. Jalal, “A hybrid feature extraction approach for human
detection, tracking and activity recognition using depth sensors,” in Arabian
Journal for Science and Engineering, vol. 41, pp. 1043-1051, 2016.
[50] M. Batool, A. Jalal, and K. Kim, “Sensors technologies for human activity
analysis based on SVM optimized by PSO algorithm,” in Proc. on ICAEM,
2019.
[51] B. Tahir, A. Jalal, and K. Kim, “IMU sensor based automatic-features
descriptor for healthcare patient’s daily life-log recognition,” in Proc. on
Applied Sciences and Technology, 2021.
[52] M. Gochoo, S. Badar, A. Jalal, and K. Kim, “Monitoring real-time personal
locomotion behaviors over smart indoor-outdoor environments via body-
worn sensors,” in IEEE Access, 2021.
[53] U. Azmat and A. Jalal, “Smartphone inertial sensors for human locomotion
activity recognition based on template matching and codebook generation,”
in Proc. on Communication Technologies, 2021.
[54] A. Jalal, M.A.K. Quaid, and K. Kim, “A Wrist worn acceleration based
human motion analysis and classification for ambient smart home System,”
in Journal of Electrical Engineering & Technology, 2019.
[55] M. Batool, A. Jalal, and K. Kim, “Telemonitoring of daily activity using
accelerometer and gyroscope in smart home environments,” in Journal of
Electrical Engineering and Technology, 2020.
[56] A. Jalal, M.A.K. Quaid, S.B. Tahir, and K. Kim, “A study of accelerometer
and gyroscope measurements in physical life-log activities detection
systems,” in Sensors, 2020.
[57] A. Jalal, Y.H. Kim, Y.J. Kim, S. Kamal, and D. Kim, “Robust human activity
recognition from depth video using spatiotemporal multi-fused features,” in
Pattern recognition, vol. 61, pp. 295-308, 2017.
[58] A. Jalal and S. Kamal, “Improved behavior monitoring and classification
using cues parameters extraction from camera array images,” in International
Journal of Interactive multimedia and Artificial Intelligence, vol. 5, 2018.
[59] K. Kim, A. Jalal, and M. Mahmood, “Vision-based human activity
recognition system using depth silhouettes: A smart home system for
monitoring the residents,” in Journal of Electrical Engineering and
Technology, 2019.
[60] A. Jalal, S. Kamal, and D. Kim, “A depth video-based human detection and
activity recognition using multi-features and embedded hidden Markov
models for health care monitoring systems,” in International Journal of
Interactive multimedia and Artificial Intelligence, vol. 4, pp. 54-62, 2017.
[61] A. Ahmed, A. Jalal, and K. Kim, “Multi‑objects detection and segmentation
for scene understanding based on Texton forest and kernel sliding
perceptron,” in Journal of Electrical Engineering and Technology, 2020.
[62] I. Akhter, A. Jalal, and K. Kim, “Pose estimation and detection for event
recognition using Sense-Aware features and Adaboost classifier”, in Proc.
on IBCAST, 2021.
[63] A.A. Rafique, A. Jalal, and A. Ahmed, “Scene understanding and
recognition: statistical segmented model using geometrical features and
Gaussian naïve bayes,” in Proc. on Applied and Engineering Mathematics,
2019.
[64] A. Ahmed, A. Jalal, and K. Kim, “Region and decision tree-based
segmentations for multi-objects detection and classification in outdoor
scenes,” in Proc. on Frontiers of Information Technology, 2019.
[65] A.A. Rafique, A. Jalal, and K. Kim, “Statistical multi-objects segmentation
for indoor/outdoor scene detection and classification via depth images,” in
Proc. on Applied Sciences and Technology, 2020.
[66] A. Jalal, M. Mahmood, and M.A. Sidduqi, “Robust spatio-temporal features
for human interaction recognition via artificial neural network,” in Proc. on
Frontiers of information technology, 2018.
[67] A. Jalal, S. Kamal, and C. Cecer, “Depth maps-based human segmentation
and action recognition using full-body plus body color cues via recognizer
engine, in Journal of Electrical Engineering & Technology, 2018.
[68] A. Jalal and M. Mahmood, “Students’ behavior mining in e-learning
environment using cognitive processes with information technologies,” in
Education and Information Technologies Springer, 2019.
[69] A. Jalal and S. Kim, “Algorithmic implementation and efficiency
maintenance of real-time environment using low-bitrate wireless
communication,” in Proc. on Software Technologies for Future Embedded
and Ubiquitous Systems, 2006.
[70] S. Abbasi, S. Kamal, M. Gochoo, A. Jalal, and D. Kim, “Affinity-based task
scheduling on heterogeneous multicore systems using CBS and QBICTM,”
in Applied Sciences, 2021.
[71] K. Nida, G. Y. Yazeed, M. Gochoo, A. Jalal, and K. Kim, “Semantic
recognition of human-object interactions via Gaussian-based elliptical
modelling and pixel-level labeling,” in IEEE Access, 2021.
[72] H. Ansar, A. Jalal, M. Gochoo, and K. Kim “Hand gesture recognition based
on auto‐landmark localization and reweighted genetic algorithm for
healthcare muscle activities”, in Sustainability, 2021.
[73] A. Jalal, A. Nadeem, and S. Bobasu, “Human body parts estimation and
detection for physical sports movements,” in Proc. on Communication,
Computing, and Digital Systems, 2019.
[74] S. Amna, A. Jalal, and K. Kim, “An Accurate Facial expression detector
using multi-landmarks selection and local transform features,” in Proc. on
IEEE conference, 2020.
[75] A. Jalal, S. Kamal, and D.S. Kim, “Detecting complex 3D human motions
with body model low-rank representation for real-time smart activity
monitoring system,” KSII Transactions on Internet and Information
Systems, vol. 12, pp. 1189-1204, 2018.
[76] N. Amir, A. Jalal, and K. Kim, “Automatic human posture estimation for
sport activity recognition with robust body parts detection and entropy
markov model,” in Multimedia Tools and Applications, 2021.
[77] A. Rafique, A. Jalal, and K. Kim, “Automated sustainable multi-object
segmentation and recognition via modified sampling consensus and kernel
sliding perceptron,” in Symmetry, 2020.
[78] I. Akhter, A. Jalal, and K. Kim, “Adaptive pose estimation for gait event
detection using context‑aware model and hierarchical optimization,” Journal
of Electrical Engineering and Technology, 2021.
[79] S.A. Rizwan, A. Jalal, M. Gochoo, and K. Kim, “Robust active shape model
via hierarchical feature extraction with SFS-optimized convolution neural
network for invariant human age classification,” in Electronics, vol. 10,
2021.
[80] M. Javeed, M. Gochoo, A. Jalal, and K. Kim, “HF-SPHR: Hybrid features
for sustainable physical healthcare pattern recognition using deep belief
networks”, in Sustainability, 2021.
[81] A. Ahmed, A. Jalal, and A.A. Rafique, “Salient segmentation based object
detection and recognition using hybrid genetic transform”, in Proc. on
ICAEM conference, 2019.
[82] F. Farooq, A. Jalal, and L. Zheng, “Facial expression recognition using
hybrid features and self-organizing maps,” in Proc. on Multimedia and Expo,
2017.
[83] M. Gochoo, I. Akhter, A. Jalal, and K. Kim, “Stochastic remote sensing
event classification over adaptive posture estimation via multifused data and
deep belief network”, in Remote Sensing, 2021.
[84] A. Jalal, A. Ahmed, A. Rafique, and K. Kim “Scene Semantic recognition
based on modified Fuzzy c-mean and maximum entropy using object-to-
object relations,” in IEEE Access, vol. 9, 2021.
[85] A. Jalal, I. Akhtar, and K. Kim, “Human posture estimation and sustainable
events classification via pseudo-2D stick model and k-ary tree hashing,” in
Sustainability, 2020.
[86] A. Jalal, N. Khalid, and K. Kim, “Automatic recognition of human
interaction via hybrid descriptors and maximum entropy markov model
using depth sensors,” in Entropy, 2020.
[87] S. J. Berlin and M. John, “Human interaction recognition through deep
learning network,” in Proc. on ICCST, 2016.
[88] P. Lubina and M. Rudzki, “Artificial neural networks in accelerometer-based
human activity recognition,” in proc. on MIXDES, 2015.
[89] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould, “Dynamic
image networks for action recognition,” in Proc. on CVPR, pp. 3034–3042,
2016.
[90] G. Gkioxari, R. Girshick, and J. Malik, “Contextual action recognition with
R* CNN,” in Proc. on CVPR, pp. 1080–1088, 2015.
[91] A. Farzana, S. Abirami, and M. Sirvani, “A frame for captioning the human
interactions,” in Proc. on Advanced Computing, 2019.
[92] H. Fatta and U. Fajar, “Captioning image using convolutional neural network
(CNN) and long-short term memory (LSTM),” in International Seminar on
Research of Information Technology and Intelligent Systems, 2019.
[93] N. Otsu, “A threshold selection method from gray-level histograms”, in
IEEE Trans. Sys. Man. Cyber, vol. 9, pp. 62–66, 1979.
[94] K. Banerjee et al., “Exploring Alternatives to Softmax Function,” 2020.
[95] A. Shahroudy, J. Liu, T. Ng, and G. Wang, “NTU RGB+D: A large scale
dataset for 3D human activity analysis,” in Proc. on CVPR, 2016.
[96] J. Liu et al., “NTU RGB+D 120: A large-scale benchmark for 3D human
activity understanding”, in TPAMI, 2019.
[97] C. Coppola, S. Cosar, D.R. Faria, and N. Bellotto. “Automatic detection of
human interactions from RGB-D data for social activity classification,” in
Proc. on RO-MAN, Lisbon, Portugal, 2017.
[98] L. Breiman, “Random forests”, in Machine Learning, vol. 45, pp. 5–32,
2001.
[99] S. Zhang, X. Liu, and J. Xiao. “On geometric features for skeleton-based
action recognition using multilayer lstm net-works,” in Proc. on WACV, pp.
148–157, 2017.
[100] I. Lee, D. Kim, S. Kang, and S. Lee, “Ensemble deep learning for skeleton-
based action recognition using temporal sliding LSTM networks,” in Proc.
on ICCV, 2017.
[101] A. Shahroudy, T. Ng, Y. Gong, and G. Wang, “Deep multimodal feature
analysis for action recognition in RGB+D videos,” in TPAMI, vol. 40, pp.
1045–1058, 2018.
[102] J. Lee and B. Ahn, “Real-time human action recognition with a low-cost
RGB camera and mobile robot platform,” in Sensors, vol. 20, pp. 2886,
2020.
[103] M. Liu, H. Liu, and C. Chen, “Enhanced skeleton visual-ization for view
invariant human action recognition,” in Pattern Recognition, vol. 68, pp.
346–362, 2017.
[104] C. Coppola, D.R. Faria, U. Nunes, and N. Bellotto, “Social activity
recognition based on probabilistic merging of skeleton features with
proximity priors from RGB-D data,” in Proc. of the IEEE/RSJ IROS, 2016.
[105] M. Ehatisham-Ul-Haq et al., “Robust human activity recognition using
multimodal feature-level fusion,” in IEEE Access, vol. 7, pp. 60736-60751,
2019.