Conference PaperPDF Available

A Novel Deep Learning Model for Understanding Two-Person Interactions Using Depth Sensors

November 2021

November 2021

DOI:10.1109/ICIC53490.2021.9692946

Conference: 2021 International Conference on Innovative Computing (ICIC)

Authors:

Manahil Waheed

Air University of Islamabad

Madiha Javeed

Air University of Islamabad

Ahmad Jalal

Pohang University of Science and Technology

Despite the ever-increasing efforts made in the field of data science and artificial intelligence, the task of automatic human interaction recognition remains challenging. Advanced computer vision sensors like depth sensors have made it easier to achieve the goal of accurate recognition of human interactions in complex situations. The reason for their success is that they are robust against lighting and illumination variation and are insensitive to color and texture changes. Therefore, the proposed system combines both RGB and depth images to train a Convolutional Neural Network (CNN). The robust features extracted from CNN have been classified using a Softmax classifier. Two publically available large RGB-D datasets have been used for training and evaluating the performance of the proposed method. The proposed method has achieved an accuracy of 87.03% with the NTU RGB+D dataset and 86.21% with the UoL 3D Social Interaction dataset.

A general overview of the proposed architecture.

…

Histogram Equalization. (a) original image, (b) histogram of original image, (c) image after histogram equalization, and (d) histogram of the equalized image.

…

Figures - uploaded by Ahmad Jalal

Content may be subject to copyright.

Content uploaded by Ahmad Jalal

Content may be subject to copyright.

2021 International Conference on Innovative Computing (ICIC)

A Novel Deep Learning Model for Understanding

Two-Person Interactions Using Depth Sensors

Manahil Waheed

Dept. of Creative Technologies,

Air University

Islamabad, Pakistan

manahilwaheedar@gmail.com

Madiha Javeed

Dept. of Computer Science,

Air University

Islamabad, Pakistan

191880@students.au.edu.pk

Ahmad Jalal

Dept. of Computer Science

Air University

Islamabad, Pakistan

ahmadjalal@mail.au.edu.pk

Abstract—Despite the ever-increasing efforts made in the field of

data science and artificial intelligence, the task of automatic human

interaction recognition remains challenging. Advanced computer

vision sensors like depth sensors have made it easier to achieve the

goal of accurate recognition of human interactions in complex

situations. The reason for their success is that they are robust against

lighting and illumination variation and are insensitive to color and

texture changes. Therefore, the proposed system combines both RGB

and depth images to train a Convolutional Neural Network (CNN).

The robust features extracted from CNN have been classified using

a Softmax classifier. Two publically available large RGB-D datasets

have been used for training and evaluating the performance of the

proposed method. The proposed method has achieved an accuracy of

87.03% with the NTU RGB+D dataset and 86.21% with the UoL 3D

Social Interaction dataset.

Keywords—convolution neural network, deep learning, depth

videos, human interaction recognition, softmax classifier.

I. INTRODUCTION

Human interaction recognition (HIR) refers to the task of

understanding a mutual activity performed by two human beings.

This field has attracted many researchers owing to its wide range

of applications, including security [1-4], smart homes [5-10],

content-based video retrieval [11-14], healthcare [15-20],

surveillance [21-24], and human tracking [25-29]. However, it is

a complex task because of multiple reasons, such as change of

viewpoint, occlusion, variation in clothing and lighting

conditions, low-resolution images, and unavailability of large

datasets. Some progress has been made ever since the introduction

of low-cost depth sensors, such as Microsoft Kinect [30-32],

because they are not as affected by lightning conditions as RGB

cameras.

This research proposes a fusion of RGB and depth images to

train a CNN model. The UoL (University of Lincoln) 3-D Social

Interaction dataset provides RGB and depth images. The NTU

RGB+D (Nanyang Technological University's Red Blue Green

and Depth) dataset comprises RGB videos and the corresponding

depth maps. Hence, the RGB videos have been converted into

image frames. To reduce the computational complexity, only 10

keyframes have been selected from each video. The keyframes

have been extracted by comparing the histograms of consecutive

frames. The differences between the histograms of every two

consecutive image frames have been stored in an array and the

frames with the highest differences have been selected as

keyframes afterward. Once keyframes have been extracted from

RGB videos, they have been combined with the corresponding

depth frames. Next, the 4-dimensional images have been fed to a

CNN model that uses VGG-16 (Visual Geometry Group-16 layers

deep) [33] as the base model. Finally, a Softmax classifier has

been proposed for classification.

Similar research work is described in Section II and the

proposed methodology is discussed in Section III. Section IV

presents the implementation details and results of the proposed

method. The conclusion of the research is given in Section V.

II. RELATED WORK

Recent years have seen a lot of progress in the field of human

activity recognition [34-38]. However, identifying interactions

between two human beings is a more challenging task [39]. For

this purpose, many researchers have preferred RGBD data over

RGB data [40-45]. With the availability of this additional depth

information, depth gradients can also be used to extract local

features [46-49]. Moreover, both sensor-based [50-56] and vision-

based [57-60] HIR systems have been developed in the past.

The first step in recognizing human interactions in videos is to

represent events and scenes as image features [61-65]. Based on

those features, an interaction class is assigned to the input video

[66-70]. Another important step during feature extraction is the

identification of key body parts [71-74] and pose estimation [75-

77]. A common approach is to extract hierarchical features

[78,79] from human bodies. Some researchers have also chosen

hybrid features for better classification results [80-83]. For

example, researchers in [84] used a combination of different

blobs, multiple orientations, Fourier transforms, and

geometrical points over the objects as features. A. Jalal et al. [85]

extracted various features, including energy, sine, distinct body

parts movements, and a 3D Cartesian view of smoothing gradients

features. Similarly, a hybrid of four different local descriptors was

used by the authors of [86], i.e., spatio-temporal features, energy-

based features, shape-based angular and geometric features, and a

motion-orthogonal histogram of oriented gradient (MO-HOG).

CNN has very extensively been used for classification purposes

[87-90]. And it has also proved efficient as a feature extractor.

[91,92] have used CNN as an encoder for their image captioning

systems while using inception V3 as the base model.

III. METHODOLOGY

This section discusses the proposed methodology for HIR. The

system takes both RGB and depth videos as input. The videos are

first converted into images at the rate of 31 frames per second and

then 10 keyframes are extracted from each video. Pre-processing

is done over the extracted keyframes to enhance the image quality,

making it easier to extract the desired features. These pre-

processed images are then used for human detection and

segmentation. The segmented RGB and depth images are

concatenated and then fed to a CNN model, which extracts

important features from them. These features are then given to the

Softmax classifier that generates the class labels. Fig. 1 shows an

overview of this method.

Fig. 1. A general overview of the proposed architecture.

A. Preprocessing

The NTU RGB+D dataset provides RGB videos and depth

frames. Hence, the RGB videos have been converted into frames

to get the same number of frames against each RGB video as the

depth maps. Since the dataset has 48 videos per class and this

research uses 11 classes, it is computationally very expensive to

keep all the extracted frames. Therefore, only 10 keyframes have

been extracted from each video. The keyframes have been

extracted by computing the differences between the histograms of

every two consecutive frames. The top ten frames corresponding

to the highest differences have been selected. All RGB and depth

images have been cropped to obtain the desired regions. Then they

have been pre-processed using multiple techniques discussed in

detail below. Applying such preprocessing techniques helps

improve the overall accuracy of the system.

1) Histogram Equalization:

To improve the quality of the images, the contrast is enhanced

by adjusting the intensity values. This is done by the histogram

equalization technique. Eq. 1 represents the normalized histogram

 of an image , as indicated by eq. (1).



       (1)

where  is the number of pixels with intensity ,  ranges

from 0 to    ( is 256), and  is the total number of pixels in

the image. The histogram equalized image is defined by eq. (2).

    



 (2)

The results of histogram equalization are shown in Fig. 2.

(a) (b) (c) (d)

Fig. 2. Histogram Equalization. (a) original image, (b) histogram of original image,

2) Image Smoothing

After histogram equalization, all images have been de-noised

using mean filtering. In this method, each pixel value in the image

has been replaced by the mean of its neighboring pixels, as shown

in eq. (3).









 (3)

where  and j are the pixel values and is the window size,

i.e., the number of neighboring pixels.

B. Image Segmentation

Image segmentation reduces the complexity of the image as it

returns only the desired part of the image. Moreover, it makes

sense to remove the redundant background from all the images so

the features of the background, which will be the same for

different classes, do not play a part while determining the

interaction class. In order to segment human beings from RGB

and depth images, two image segmentation techniques have been

used, as discussed in detail below.

1) RGB Image Segmentation

Humans have been segmented from the RGB images using the

edge detection technique. Edges or boundaries are detected based

on discontinuity in the intensity values of the pixels. For this

purpose, all RGB images are first converted into grayscale images

and a binary silhouette is extracted using the detected edges. A

floor detection and removal technique has also been implemented

for the NTU RGB+D dataset where the floor often gets

misclassified as the foreground. Based on the range of its intensity

values, a floor mask has been created, which is then used to remove

the floor. The original RGB pixel values are then restored in the

detected binary silhouettes to get the desired RGB silhouettes. Fig.

3 shows the results of the RGB image segmentation stage.

(a)

(b)

Fig. 3. RGB silhouette segmentation (a) ‘hugging’ interaction (NTU RGB+D)

original image (left), binary silhouette (center), and segmented image (right), (b)

‘shaking hands’ interaction (UoL 3D) original image (left), binary silhouette

(center), and segmented image (right).

2) Depth Image Segmentation

For depth images, Ostu’s thresholding technique has been used

[93]. The intensity values of the depth images available in the NTU

RGB+D dataset have been adjusted as they were too dark to be

seen. Then the cropped and intensity-adjusted images have been

segmented using Ostu’s thresholding technique shown in eq. (4).

Fig. 4 shows the results of the depth image segmentation stage.



 

 

 (4)

where 

is the weighted sum of intra-class variances of the

two classes (foreground and background) and  is the threshold

value.

(a)

(b)

Fig. 4. Depth silhouette segmentation (a) ‘kicking’ interaction (NTU RGB+D)

original intensity-adjusted image (left), binary silhouette (center), and segmented

image (right) (b) ‘help stand up’ interaction (UoL 3D) original image (left), binary

silhouette (center), and segmented image (right).

C. Feature Extraction via CNN

For extraction of features from images, a Convolutional Neural

Network (CNN) has been used. The transfer learning approach

has been employed, which includes using VGG16 as the base

model and then fine-tuning its weights according to the used

datasets. VGG16 is a CNN model that achieves 92.7% on the

ImageNet dataset which has 1000 classes. Fig. 5 shows all the

layers in the VGG16 model.

Fig. 5. Different layers of the VGG16 architecture with configurations.

First, the testing and training input images have been trained

on the VGG16 model and the resulting images having the

dimensions of 7x7x512 have been obtained. Then these have been

trained on the proposed CNN model which has three

convolutional layers with 128, 64, and 32 filters respectively. The

convolution layers compute the output of neurons that are

connected to local regions in the input. Convolution is similar to

sliding a filter over an image, computing the dot product of filter

weights and image pixels. Rectified Linear Unit (RELU) has been

used as the activation function for all three convolution layers. It

simply rounds up all the negative values to zero. Then a batch

normalization layer followed by a flatten layer has been used.

Lastly, a drop out layer of 0.2 has been used to avoid overfitting.

Fig. 6 shows the layers in the proposed model. Table I shows a

summary of the proposed CNN model.

Fig. 6. A general overview of the layers in our CNN model.

Conv:128

Conv:64

Conv:32

BatchNorm

Flatten

Drop out: 0.2

7x7x512

1x1x40

1x1x10

Convolution

Max pooling

Flatten

Softmax

1x1x40

TABLE I. A BRIEF SUMMARY OF OUR CNN MODEL

Layer

Output Shape

Parameters

Conv:128

(None,7,7,128)

65664

Conv:64

(None,7,7,64)

8256

Con:32

(None,7,7,32)

2080

BatchNorm

(None,7,7,32)

128

Flatten

(None,7,7,1568)

Dropout

(None,7,7,1568)

Softmax

(None,7,7,11)

17259

D. Human Interaction Recognition Using Softmax

After extracting features using CNN, the softmax classifier has

been used to recognize human interactions. The softmax function

is a popular choice for multiclass classification [94]. It readjusts

the probabilities of all the classes such that the sum of their

probabilities is 1. The softmax output for each class is computed

using eq. (5).







 (5)

where  is the probability of each class,  is the class, and n is

the total number of classes. The normalized sum of these

probabilities is always equal to 1.

IV. EXPERIMENTAL SETUP AND RESULTS

This section gives a brief description of the datasets used for

experimentation, the implementation details, and the results of

different experiments conducted to evaluate the performance of the

proposed HIR model. The results also contain a comparison of the

proposed system’s accuracy with that of other state-of-the-art

systems.

Datasets

1) NTU RGB+D dataset

The NTU RGB+D dataset [95,96] consists of 60 classes, 11 of

which are two-person interactions: punching, kicking, pushing, pat

on back, point finger, hugging, giving object, touch pocket,

shaking hands, walking towards, and walking apart. There are 48

videos for each interaction class. Each session has three sets of

videos since each video is recorded from three different

viewpoints. Fig. 7 shows a few sample frames from this dataset.

(a)

(b)

Fig.7. NTU RGB+D dataset. (a) RGB frame for ‘giving object’ (left), RGB frame

for ‘punching’ (center), and RGB frame for ‘pat on back’ (right), (b) depth frame

for ‘giving object’ (left), depth frame for ‘punching’ (center), and depth frame for

‘pat on back’ (right).

2) UoL 3D Social Interaction dataset

The UoL 3D social interaction dataset [97] provided RGB+D

videos and skeleton information of 8 interaction classes: shaking

hands, talk, help walk, help stand up, hug, push, fight, and draw

attention. This dataset includes ten sessions, each comprising two

long videos containing all eight interactions. The skeleton tracks

are provided in a text format. Information about 25 skeleton joints

is provided. Fig. 8 shows a few sample frames from this dataset.

(a)

(b)

Fig.8. UoL 3D Social Interaction dataset. (a) RGB frame for ‘hugging’ (left), RGB

frame for ‘shaking hands’ (center), and RGB frame for ‘kicking’ (right), (b) depth

frame for ‘hugging’ (left), depth frame for ‘shaking hands’ (center), and depth

frame for ‘kicking’ (right).

Implementation

The proposed CNN model has been developed in Python using

Jupyter Notebook. Python’s deep learning library, Keras, has been

used as it provides the VGG-16 model and different layers for

convolution, batch normalization, flattening, drop out, and

softmax. The proposed model has been trained for 30 epochs. Fig.

9 shows how the model’s accuracy increased and loss decreased

with the increase in the number of epochs.

(a)

(b)

Fig.9. Accuracy and loss graphs: (a) model accuracy increased with increasing

epochs and (b) model loss decreased with increasing epochs.

Results

To evaluate the performance of the proposed system, a one-

third training validation test has been used. Tables II and III show

the average accuracies achieved by the proposed model and the

accuracies achieved per interaction class over the NTU RGB+D

and UoL 3D Social Interaction datasets respectively.

TABLE II. RECOGNITION ACCURACIES OF CLASSES OF NTU RGB+D DATASET

Class

Accuracy

(%)

Class

Accuracy

(%)

Class

Accuracy

(%)

punching

82.23

pat on

back

79.94

giving

object

80.76

kicking

94.72

point

finger

95.15

touch

pocket

83.05

pushing

82.02

hugging

95.15

shaking

hands

92.51

walking

towards

89.24

walking

apart

82.56

average

accuracy

87.03

TABLE III. RECOGNITION ACCURACIES OF CLASSES OF UOL 3D DATASET

Class

Accuracy

(%)

Class

Accuracy

(%)

Class

Accuracy

(%)

handshake

81.12

help

stand

83.21

fight

86.04

talk

88.45

hug

90.02

draw

attention

88.21

walking

towards

89.24

walking

apart

82.56

average

accuracy

87.03

Tables IV and V show a comparison of the results of the

proposed system with two state-of-the-art (SOTA) classifiers:

Bayesian and Random Forest [98]. The proposed method also

outperforms some recent state-of-the-art methods as shown in

Tables VI and VII.

Table IV. COMPARISON OF THE PROPOSED METHOD WITH SOTA CLASSIFIERS

OVER NTU RGB+D DATASET

Interaction

Class

Recognition Accuracy (%)

Random Forest

Bayesian

Proposed method

punching

47.43

96.41

82.23

kicking

91.56

72.05

94.72

pushing

86.74

92.56

82.02

pat on back

79.48

84.36

79.94

point finger

91.56

79.49

95.15

hugging

87.74

74.62

95.15

giving object

73.07

71.54

80.76

touch pocket

72.35

76.67

83.05

shaking hands

87.17

83.07

92.51

walking towards

88.74

70.26

89.24

walking apart

50.24

70.26

82.56

average accuracy

77.82

79.21

87.03

Table V. COMPARISON OF THE PROPOSED METHOD SOTA CLASSIFIERS OVER UOL

3D DATASET

Interaction

Class

Recognition Accuracy (%)

Random

Forest

Bayesian

Proposed method

handshake

52.43

84.41

81.12

talk

85.71

82.05

88.45

help walk

82.45

72.56

85.32

help stand up

79.48

74.36

83.21

hug

85.71

79.49

90.02

push

78.34

74.62

87.30

fight

77.07

81.54

86.04

draw attention

74.35

76.67

88.21

average

accuracy

76.94

78.21

86.21

TABLE VI. COMPARISON OF THE PROPOSED METHOD WITH SOTA METHODS OVER

NTU RGB+D DATASET

Authors

Methods

Accuracy (%)

Songyang et al.[99]

geometric features

70.26

Inwoong et al.[100]

ensemble TS-LSTM v2

74.60

Amir et al.[101]

deep multimodal features

74.9

Junwoo et al.[102]

mobile robot platform

75.0

Mengyuan et

al.[103]

enhanced skeletal visualization

75.97

proposed model

87.03

TABLE VII. COMPARISON OF THE PROPOSED METHOD WITH SOTA METHODS OVER

UOL 3D DATASET

Authors

Methods

Accuracy

(%)

Claudio et al.[104]

probabilistic merging of

skeletal features

85.1

Muhammad et

al.[105]

multimodal feature level fusion

85.12

Claudio et al.[97]

statistical and geometrical

features

85.56

proposed model

86.21

V. CONCLUSION

In this paper, an HIR system has been proposed that efficiently

recognizes complex human-to-human interactions using both

RGB and depth information. The performed experiments have

shown that RGB-D images give better results than RGB images.

Furthermore, using only 10 keyframes instead of the entire videos

takes less time in model training.

As future work, the researchers plan to explore new and better

ways of fusing RGB and depth images for a more efficient

classification system. It is also intended to train and evaluate the

proposed system on more challenging datasets.

REFERENCES

[1] O. Aran and D. Gatica-Perez, “One of a kind: Inferring personality

impressions in meetings,” in Proc. on ICMI (ACM), pp. 11–18, 2013.

[2] A. Jalal, S. Kamal, and D. Kim, “Depth map-based human activity tracking

and recognition using body joints features and self-organized map,” in Proc.

on CCNT, pp. 1-6, 2014.

[3] A. Jalal and Y. Kim, “Dense depth maps-based human pose tracking and

recognition in dynamic scenes using ridge data,” in Proc. on Advanced

Video and Signal-based Surveillance, pp. 119-124, 2014.

[4] A. Jalal, S. Kamal, and D. Kim, “Shape and motion features approach for

activity tracking and recognition from Kinect video camera,” in Proc. on

Advanced Information Networking and Applications Workshops, pp. 445-

450, 2015.

[5] A. Jalal, N. Sharif, J.T. Kim, and T.S. Kim, “Human activity recognition via

recognized body parts of human depth silhouettes for residents monitoring

services at smart homes,” in Indoor and Built Environment, vol. 22, pp. 271-

279, 2013.

[6] A. Jalal, M.A.K. Quaid, and M.A. Sidduqi, “A triaxial acceleration-based

human motion detection for ambient smart home system,” in Proc. on

Applied Sciences and Technology, 2019.

[7] A. Jalal, S. Lee, J. Kim, and T. Kim, “Human activity recognition via the

features of labeled depth body parts,” in Proc. on Smart Homes Health

Telematics, pp. 246-249, 2012.

[8] A. Jalal, J.T. Kim, and T.S Kim, “Development of a life logging system via

depth imaging-based human activity recognition for smart homes,” in Proc.

on Sustainable Healthy Buildings, pp. 91-95, 2012.

[9] A. Jalal, S. Kamal, and D. Kim, “A depth video sensor-based life-logging

human activity recognition system for elderly care in smart indoor

environments,” in Sensors, vol. 14, pp. 11735-11759, 2014.

[10] T. Kim, A. Jalal, H. Han, H. Jeon, and J. Kim, “Real-time life logging via

depth imaging-based human activity recognition towards smart homes

services,” in Proc. on Renewable Energy Sources and Healthy Buildings, pp.

63, 2013.

[11] G.H. Liu, J.Y. Yang, and Z. Li, “Content-based image retrieval using

computational visual attention model,” in Pattern Recognition, vol. 48, pp.

2554–2566, 2015.

[12] S. Sempena, N.U. Maulidevi, and P.R. Aryan, “Human action recognition

using dynamic time warping,” in Proc. on ICEEI, pp. 1–5, 2011.

[13] A. Jalal, S. Kamal, and D. Kim, “Facial expression recognition using 1D

transform features and hidden markov model,” in Journal of Electrical

Engineering & Technology, vol. 12, pp. 1657-1662, 2017.

[14] M. Mahmood, A. Jalal, and H. A. Evans, “Facial expression recognition in

image sequences using 1D transform and gabor wavelet transform,” in Proc.

on Applied and Engineering Mathematics, 2018.

[15] A. Jalal, M. Batool, and K. Kim, “Stochastic recognition of physical activity

and healthcare using tri-axial inertial wearable sensors,” in Applied

Sciences, 2020.

[16] A. Jalal, M.A.K. Quaid, S.B. Tahir, and K. Kim, “A study of accelerometer

and gyroscope measurements in physical life-log activities detection

systems,” in Sensors, 2020.

[17] A. Jalal, M. Batool, and K. Kim, “Sustainable wearable system: human

behavior modeling for life-logging activities using k-ary tree hashing

classifier,” in Sustainability, 2020.

[18] M. Javeed, A. Jalal, and K. Kim, “Wearable sensors based exertion

recognition using statistical features and random forest for physical

healthcare monitoring,” in Proc. on Applied Sciences and Technology, 2021.

[19] A. Jalal, M. Batool and B. Tahir, “Markerless sensors for physical health

monitoring system using ECG and GMM feature extraction,” in Proc. on

IBCAST, 2021.

[20] A. Jalal, M.A.K. Quaid, and A.S. Hasan, “Wearable sensor-based human

behavior understanding and recognition in daily life for smart

environments,” in Proc. on Frontiers of Information Technology, 2018.

[21] A. Shehzad, A. Jalal, and K. Kim, “Multi-person tracking in smart

surveillance system for crowd counting and normal/abnormal events

detection,” in Proc. on Applied and Engineering Mathematics, 2019.

[22] P. Mahwish, G. Yazeed, M. Gochoo, A. Jalal, S. Kamal, and D. Kim, “A

smart surveillance system for people counting and tracking using particle

flow and modified SOM,” in Sustainability, 2021.

[23] P. Mahwish, A. Jalal, and K. Kim, “Hybrid algorithm for multi people

counting and tracking for smart surveillance,” in Proc. on IBCAST, 2021.

[24] N. Khalid, M. Gochoo, A. Jalal, and K. Kim, “Modeling two-person

segmentation and locomotion for stereoscopic action identification: a

sustainable video surveillance system,” in Sustainability, 2021.

[25] A. Jalal, Y. Kim, S. Kamal, A. Farooq, and D. Kim, “Human daily activity

recognition with joints plus body features representation using Kinect

sensor,” in Proc. on Informatics, Electronics, and Vision, 2015.

[26] A. Jalal, S. Kamal, A. Farooq, and D. Kim, “A spatiotemporal motion

variation features extraction approach for human tracking and pose-based

action recognition,” in Proc. on Informatics, Electronics, and Vision, 2015.

[27] A. Nadeem, A. Jalal, and K. Kim, “Human actions tracking and recognition

based on body parts detection via artificial neural network,” in Proc. on

Advancements in Computational Sciences, 2020.

[28] S. Kamal, A. Jalal, and D. Kim, “Depth images-based human detection,

tracking and activity recognition using spatiotemporal features and Modified

HMM, in Journal of Electrical Engineering and Technology, pp. 1921-1926,

2016.

[29] A. Jalal, M. Mahmood, and A. S. Hasan, “Multi-features descriptors for

human activity tracking and recognition in Indoor-outdoor environments,”

in Proc. on Applied Sciences and Technology, 2019.

[30] M. Asadi-Aghbolaghi, et al., “A survey on deep learning based approaches

for action and gesture recognition in image sequences,” in Proc. on

Automatic Face & Gesture Recognition, 2017.

[31] A. Jalal, S. Kamal, and D. Kim, “Human depth sensors-based activity

recognition using spatiotemporal features and hidden markov model for

smart environments,” in Journal of Computer Networks and

Communications, vol. 2016, pp. 1-11, 2016.

[32] A. Jalal, S. Kamal, and D. Kim, “Depth Silhouettes Context: A new robust

feature for human tracking and activity recognition based on embedded

HMMs,” in Proc. on Ubiquitous Robots and Ambient Intelligence, pp. 294-

299, 2015.

[33] K. Simonyan and A. Zisserman, “Very deep convolutional networks for

large-scale image recognition,” in Proc. on Learning Representations, 2015.

[34] A. Jalal, J.T. Kim, and T.S. Kim, “Human activity recognition using the

labeled depth body parts information of depth silhouettes,” in Proc. on

Sustainable Healthy Buildings, pp. 1-8, 2012.

[35] A. Jalal, M.Z. Uddin, and T.S. Kim, “Depth video-based human activity

recognition system using translation and scaling invariant features for life

logging at smart home,” in IEEE Transaction on Consumer Electronics, vol.

58, pp. 863-871, 2012.

[36] A. Jalal, M.Z. Uddin, J.T. Kim, and T.S. Kim, “Daily human activity

recognition using depth Silhouettes and R transformation for smart home,”

in Proc. on Smart Homes Health Telematics, pp. 25-32, 2011.

[37] S. Badar, A. Jalal, and M. Batool, “Wearable Sensors for activity analysis

using SMO-based random forest over smart home and sports datasets”, in

Proc. on ICACS, 2020.

[38] S. Badar, A. Jalal, and K. Kim, “Wearable inertial sensors for daily activity

analysis based on Adam optimization and the maximum entropy markov

model”, in Entropy, vol. 22, pp. 1-19, 2020.

[39] A. Stergiou and R Poppe1, “Understanding human-human interactions: a

survey,” in Computer Vision and Image Understanding, 2019.

[40] A. Jalal, Y. Kim, and D. Kim, “Ridge body parts features for human pose

estimation and recognition from RGB-D video data,” in Proc. on Computing,

Communication and Networking Technologies, pp. 1-6, 2014.

[41] M. Mahmood, A. Jalal, and K. Kim, “WHITE STAG model: Wise human

interaction tracking and estimation (WHITE) using spatio-temporal and

angular-geometric (STAG) Descriptors”, in Multimedia Tools and

Applications, 2020.

[42] A. Farooq, A. Jalal, and S. Kamal, “Dense RGB-D map-based human

tracking and activity recognition using skin joints features and self-

organizing map,” in KSII Transactions on internet and information systems,

vol. 9, pp. 1856-1869, 2015.

[43] A. Ahmed, A. Jalal, and K. Kim, “RGB-D images for object segmentation,

localization and recognition in indoor scenes using feature descriptor and

Hough voting”, in Proc. on Applied Sciences and Technology, 2020.

[44] M. Gochoo, S.R. Amna, G. Yazeed, A. Jalal, S. Kamal, and D. Kim, “A

systematic deep learning based overhead tracking and counting system using

RGB-D remote cameras,” in Applied Sciences, 2021.

[45] M.A.K. Quaid and A. Jalal, “Wearable sensors based human behavioral

pattern recognition using statistical features and reweighted genetic

Algorithm,” in Multimedia Tools and Applications, 2019.

[46] A. Ahmed, A. Jalal, and K. Kim, “A novel statistical method for scene

classification based on multi-object categorization and logistic regression,”

in Sensors, 2020.

[47] S. Li, W. Zhang, and A.B. Chan, “Maximum-margin structured learning with

deep networks for 3D human pose estimation”, in Proc. on ICCV, pp. 2848–

2856, 2015.

[48] A. Jalal, S. Kamal, and D. Kim, “Individual Detection-Tracking-Recognition

using depth activity images,” in Proc. on Ubiquitous Robots and Ambient

Intelligence, pp. 450-455, 2015.

[49] S. Kamal and A. Jalal, “A hybrid feature extraction approach for human

detection, tracking and activity recognition using depth sensors,” in Arabian

Journal for Science and Engineering, vol. 41, pp. 1043-1051, 2016.

[50] M. Batool, A. Jalal, and K. Kim, “Sensors technologies for human activity

analysis based on SVM optimized by PSO algorithm,” in Proc. on ICAEM,

2019.

[51] B. Tahir, A. Jalal, and K. Kim, “IMU sensor based automatic-features

descriptor for healthcare patient’s daily life-log recognition,” in Proc. on

Applied Sciences and Technology, 2021.

[52] M. Gochoo, S. Badar, A. Jalal, and K. Kim, “Monitoring real-time personal

locomotion behaviors over smart indoor-outdoor environments via body-

worn sensors,” in IEEE Access, 2021.

[53] U. Azmat and A. Jalal, “Smartphone inertial sensors for human locomotion

activity recognition based on template matching and codebook generation,”

in Proc. on Communication Technologies, 2021.

[54] A. Jalal, M.A.K. Quaid, and K. Kim, “A Wrist worn acceleration based

human motion analysis and classification for ambient smart home System,”

in Journal of Electrical Engineering & Technology, 2019.

[55] M. Batool, A. Jalal, and K. Kim, “Telemonitoring of daily activity using

accelerometer and gyroscope in smart home environments,” in Journal of

Electrical Engineering and Technology, 2020.

[56] A. Jalal, M.A.K. Quaid, S.B. Tahir, and K. Kim, “A study of accelerometer

and gyroscope measurements in physical life-log activities detection

systems,” in Sensors, 2020.

[57] A. Jalal, Y.H. Kim, Y.J. Kim, S. Kamal, and D. Kim, “Robust human activity

recognition from depth video using spatiotemporal multi-fused features,” in

Pattern recognition, vol. 61, pp. 295-308, 2017.

[58] A. Jalal and S. Kamal, “Improved behavior monitoring and classification

using cues parameters extraction from camera array images,” in International

Journal of Interactive multimedia and Artificial Intelligence, vol. 5, 2018.

[59] K. Kim, A. Jalal, and M. Mahmood, “Vision-based human activity

recognition system using depth silhouettes: A smart home system for

monitoring the residents,” in Journal of Electrical Engineering and

Technology, 2019.

[60] A. Jalal, S. Kamal, and D. Kim, “A depth video-based human detection and

activity recognition using multi-features and embedded hidden Markov

models for health care monitoring systems,” in International Journal of

Interactive multimedia and Artificial Intelligence, vol. 4, pp. 54-62, 2017.

[61] A. Ahmed, A. Jalal, and K. Kim, “Multi‑objects detection and segmentation

for scene understanding based on Texton forest and kernel sliding

perceptron,” in Journal of Electrical Engineering and Technology, 2020.

[62] I. Akhter, A. Jalal, and K. Kim, “Pose estimation and detection for event

recognition using Sense-Aware features and Adaboost classifier”, in Proc.

on IBCAST, 2021.

[63] A.A. Rafique, A. Jalal, and A. Ahmed, “Scene understanding and

recognition: statistical segmented model using geometrical features and

Gaussian naïve bayes,” in Proc. on Applied and Engineering Mathematics,

2019.

[64] A. Ahmed, A. Jalal, and K. Kim, “Region and decision tree-based

segmentations for multi-objects detection and classification in outdoor

scenes,” in Proc. on Frontiers of Information Technology, 2019.

[65] A.A. Rafique, A. Jalal, and K. Kim, “Statistical multi-objects segmentation

for indoor/outdoor scene detection and classification via depth images,” in

Proc. on Applied Sciences and Technology, 2020.

[66] A. Jalal, M. Mahmood, and M.A. Sidduqi, “Robust spatio-temporal features

for human interaction recognition via artificial neural network,” in Proc. on

Frontiers of information technology, 2018.

[67] A. Jalal, S. Kamal, and C. Cecer, “Depth maps-based human segmentation

and action recognition using full-body plus body color cues via recognizer

engine, in Journal of Electrical Engineering & Technology, 2018.

[68] A. Jalal and M. Mahmood, “Students’ behavior mining in e-learning

environment using cognitive processes with information technologies,” in

Education and Information Technologies Springer, 2019.

[69] A. Jalal and S. Kim, “Algorithmic implementation and efficiency

maintenance of real-time environment using low-bitrate wireless

communication,” in Proc. on Software Technologies for Future Embedded

and Ubiquitous Systems, 2006.

[70] S. Abbasi, S. Kamal, M. Gochoo, A. Jalal, and D. Kim, “Affinity-based task

scheduling on heterogeneous multicore systems using CBS and QBICTM,”

in Applied Sciences, 2021.

[71] K. Nida, G. Y. Yazeed, M. Gochoo, A. Jalal, and K. Kim, “Semantic

recognition of human-object interactions via Gaussian-based elliptical

modelling and pixel-level labeling,” in IEEE Access, 2021.

[72] H. Ansar, A. Jalal, M. Gochoo, and K. Kim “Hand gesture recognition based

on auto‐landmark localization and reweighted genetic algorithm for

healthcare muscle activities”, in Sustainability, 2021.

[73] A. Jalal, A. Nadeem, and S. Bobasu, “Human body parts estimation and

detection for physical sports movements,” in Proc. on Communication,

Computing, and Digital Systems, 2019.

[74] S. Amna, A. Jalal, and K. Kim, “An Accurate Facial expression detector

using multi-landmarks selection and local transform features,” in Proc. on

IEEE conference, 2020.

[75] A. Jalal, S. Kamal, and D.S. Kim, “Detecting complex 3D human motions

with body model low-rank representation for real-time smart activity

monitoring system,” KSII Transactions on Internet and Information

Systems, vol. 12, pp. 1189-1204, 2018.

[76] N. Amir, A. Jalal, and K. Kim, “Automatic human posture estimation for

sport activity recognition with robust body parts detection and entropy

markov model,” in Multimedia Tools and Applications, 2021.

[77] A. Rafique, A. Jalal, and K. Kim, “Automated sustainable multi-object

segmentation and recognition via modified sampling consensus and kernel

sliding perceptron,” in Symmetry, 2020.

[78] I. Akhter, A. Jalal, and K. Kim, “Adaptive pose estimation for gait event

detection using context‑aware model and hierarchical optimization,” Journal

of Electrical Engineering and Technology, 2021.

[79] S.A. Rizwan, A. Jalal, M. Gochoo, and K. Kim, “Robust active shape model

via hierarchical feature extraction with SFS-optimized convolution neural

network for invariant human age classification,” in Electronics, vol. 10,

2021.

[80] M. Javeed, M. Gochoo, A. Jalal, and K. Kim, “HF-SPHR: Hybrid features

for sustainable physical healthcare pattern recognition using deep belief

networks”, in Sustainability, 2021.

[81] A. Ahmed, A. Jalal, and A.A. Rafique, “Salient segmentation based object

detection and recognition using hybrid genetic transform”, in Proc. on

ICAEM conference, 2019.

[82] F. Farooq, A. Jalal, and L. Zheng, “Facial expression recognition using

hybrid features and self-organizing maps,” in Proc. on Multimedia and Expo,

2017.

[83] M. Gochoo, I. Akhter, A. Jalal, and K. Kim, “Stochastic remote sensing

event classification over adaptive posture estimation via multifused data and

deep belief network”, in Remote Sensing, 2021.

[84] A. Jalal, A. Ahmed, A. Rafique, and K. Kim “Scene Semantic recognition

based on modified Fuzzy c-mean and maximum entropy using object-to-

object relations,” in IEEE Access, vol. 9, 2021.

[85] A. Jalal, I. Akhtar, and K. Kim, “Human posture estimation and sustainable

events classification via pseudo-2D stick model and k-ary tree hashing,” in

Sustainability, 2020.

[86] A. Jalal, N. Khalid, and K. Kim, “Automatic recognition of human

interaction via hybrid descriptors and maximum entropy markov model

using depth sensors,” in Entropy, 2020.

[87] S. J. Berlin and M. John, “Human interaction recognition through deep

learning network,” in Proc. on ICCST, 2016.

[88] P. Lubina and M. Rudzki, “Artificial neural networks in accelerometer-based

human activity recognition,” in proc. on MIXDES, 2015.

[89] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould, “Dynamic

image networks for action recognition,” in Proc. on CVPR, pp. 3034–3042,

2016.

[90] G. Gkioxari, R. Girshick, and J. Malik, “Contextual action recognition with

R* CNN,” in Proc. on CVPR, pp. 1080–1088, 2015.

[91] A. Farzana, S. Abirami, and M. Sirvani, “A frame for captioning the human

interactions,” in Proc. on Advanced Computing, 2019.

[92] H. Fatta and U. Fajar, “Captioning image using convolutional neural network

(CNN) and long-short term memory (LSTM),” in International Seminar on

Research of Information Technology and Intelligent Systems, 2019.

[93] N. Otsu, “A threshold selection method from gray-level histograms”, in

IEEE Trans. Sys. Man. Cyber, vol. 9, pp. 62–66, 1979.

[94] K. Banerjee et al., “Exploring Alternatives to Softmax Function,” 2020.

[95] A. Shahroudy, J. Liu, T. Ng, and G. Wang, “NTU RGB+D: A large scale

dataset for 3D human activity analysis,” in Proc. on CVPR, 2016.

[96] J. Liu et al., “NTU RGB+D 120: A large-scale benchmark for 3D human

activity understanding”, in TPAMI, 2019.

[97] C. Coppola, S. Cosar, D.R. Faria, and N. Bellotto. “Automatic detection of

human interactions from RGB-D data for social activity classification,” in

Proc. on RO-MAN, Lisbon, Portugal, 2017.

[98] L. Breiman, “Random forests”, in Machine Learning, vol. 45, pp. 5–32,

2001.

[99] S. Zhang, X. Liu, and J. Xiao. “On geometric features for skeleton-based

action recognition using multilayer lstm net-works,” in Proc. on WACV, pp.

148–157, 2017.

[100] I. Lee, D. Kim, S. Kang, and S. Lee, “Ensemble deep learning for skeleton-

based action recognition using temporal sliding LSTM networks,” in Proc.

on ICCV, 2017.

[101] A. Shahroudy, T. Ng, Y. Gong, and G. Wang, “Deep multimodal feature

analysis for action recognition in RGB+D videos,” in TPAMI, vol. 40, pp.

1045–1058, 2018.

[102] J. Lee and B. Ahn, “Real-time human action recognition with a low-cost

RGB camera and mobile robot platform,” in Sensors, vol. 20, pp. 2886,

2020.

[103] M. Liu, H. Liu, and C. Chen, “Enhanced skeleton visual-ization for view

invariant human action recognition,” in Pattern Recognition, vol. 68, pp.

346–362, 2017.

[104] C. Coppola, D.R. Faria, U. Nunes, and N. Bellotto, “Social activity

recognition based on probabilistic merging of skeleton features with

proximity priors from RGB-D data,” in Proc. of the IEEE/RSJ IROS, 2016.

[105] M. Ehatisham-Ul-Haq et al., “Robust human activity recognition using

multimodal feature-level fusion,” in IEEE Access, vol. 7, pp. 60736-60751,

2019.

Aerial Insights: Deep Learning-Based Human Action Recognition in Drone Imagery

Article

Full-text available

Jan 2023

Human action recognition is critical because it allows machines to comprehend and interpret human behavior, which has several real-world applications such as video surveillance, robot-human collaboration, sports analysis, and entertainment. The enormous variety in human motion and appearance is one of the most challenging problems in human action recognition. Additionally, when drones are employed for video capture, the complexity of recognition gets enhanced manyfold. The challenges including, the dynamic background, motion blur, occlusions, video capture angle, and exposure issues gets introduced that need to be taken care of. In this article, we proposed a system that deal with the mentioned challenges in drone recorded red-green-blue (RGB) videos. System first splits the video into its constituent frames and then performs a focused smoothing operation on the frames utilizing a bilateral filter. As a result, the foreground objects in the image gets enhanced while the background gets blur. After that a segmentation operation is performed using a quick shift segmentation algorithm that separates out human silhouette from the original video frame. The human skeleton was extracted from the silhouette, and key-points on the skeleton were identified. Thirteen skeleton key-points were extracted, including the head, left wrist, right wrist, left elbow, right elbow, torso, abdomen, right thigh, left thigh, right knee, left knee, right ankle, and left ankle. Using these key-points, we extracted normalized positions, their angular and distance relationship with each other, and 3D point clouds. By implementing an expectation maximization algorithm based on Gaussian mixture model, we drew elliptical clusters over the pixels using the key-points as the central positions to represent the human silhouette. Landmarks were located on the boundaries of these ellipses and were tracked from the beginning until the end of activity. After optimizing the feature matrix using a naïve Bayes feature optimizer, the classification is performed using a deep convolutional neural network. For our experimentation and the validation of our system, three benchmark datasets were utilized i.e., the UAVGesture, the DroneAction, and the UAVHuman dataset. Our model achieved a respective action recognition accuracy of 0.95, 0.90, and 0.44 on the mentioned datasets.

Thesis Asifa (210093)

Article

Full-text available

Jun 2023

Asifa Mehmood Qureshi

Road congestion, air pollution, and accident rates have all increased as a result of rising traffic density and worldwide population growth. Over the past recent years, the overall number of automobiles has increased significantly around the world. Therefore, an automated traffic monitoring system is essential for intelligent transportation management and control systems. The conventional traffic surveillance systems are based on local platforms which include the use of induction loops or static cameras mounted on the roadsides, poles, or bridges. These platforms often rely on expensive hardware which makes their implementation costly and also, lack flexibility and portability which constrains their deployment in different situations or areas. Whereas, aerial images can sense the traffic scenes with appropriate resolution over a broader area using mobile platforms. Although, there are many improved traffic monitoring systems have been introduced but still there are some challenges that need to be addressed. In this research, we have developed an efficient system for autonomous traffic monitoring based on aerial images. Moreover, the proposed model also classifies the detected vehicle into multiple vehicle classes. The proposed system involves seven steps. In the first step, all the input aerial images are pre-processed for noise removal and brightness adjustment using the defogging and gamma correction techniques respectively. Then, to separate the foreground and background, we used the segmentation technique. Next, we used the You Look Only Once (YOLO) algorithm for vehicle detection. To estimate the traffic density, we implemented a vehicle counter on each image frame. For vehicle classification, we implemented a Deep Belief Network classifier trained on Scale Invariant Feature Transform (SIFT) features. In the last stage, we used the DeepSORT tracker to track the vehicles across the extracted frames. An approximation of path trajectories followed by tracked vehicles is also performed. We used three publicly available datasets for experimentation. Different experiments have been conducted which shows the effectiveness of our proposed methodology.

Deep Skeleton Modeling and Hybrid Hand- crafted Cues over Physical Exercises

Conference Paper

Full-text available

Apr 2023

Human activities have always been complex and most important concern for researchers especially when it comes to physical exercises. Multiple methods have been proposed for physical exercise recognition using different sensors where the conventional approaches focused on either videos or motion-based sensors. Whereas, the combination of both types of data can improve the physical exercise recognition particularly for complex motion patterns. For that reason, a hybrid hand-crafted cues-based method has been proposed in this paper. Data has been collected from the multi-modality-based datasets that are publicly available. Next, three different filters have been used to sift the noise from multiple sensors-based data. Then, an overlapping windowing technique along with human silhouette extraction has been utilized to pre-process the filtered data. Further, the hybrid hand-crafted cues have been extracted using linear prediction cepstral coefficients, Gaussian markov random field, and saliency maps. Finally, the cues have been reduced using multi-layer sequential forward selection methodology and the physical exercise activities have been classified using a deep belief network.

Multi-sensors Fused IoT-based Home Surveillance via Bag of Visual and Motion Features

Conference Paper

Full-text available

Apr 2023

Internet of things (IoT) represent the small devices connected together wirelessly collecting data to make lifestyle convenient. Inertial measurement units (IMU) and cameras connected together to collect data from multiple indoor activities can also support home surveillance systems. The traditional closed-circuit television is out-fashioned due to the huge volume of storage requirements and not connected together to notify users immediately of apprehensive activities. Therefore, this paper proposes an IoT-based surveillance system for indoor environments that will upkeep the security methods inside the home. For this purpose, the fused multi-sensors-based data is acquired from two state-of-the-art datasets, namely, Opportunity++ and CMU-MMAC. This acquired data from IoT devices is further pre-processed through multiple filtering techniques according to the type of data. Then, a skeleton model has been designed for the filtered video frame sequences. Furthermore, a bag of visual and motion features has been extracted using three different techniques followed by their discrimination. Finally, the IoT-based surveillance system detects indoor activities and provides feedback to the user.

3D Shape Estimation from RGB Data Using 2.5D Features and Deep Learning

Conference Paper

Full-text available

Feb 2023

Creation of 3D models from a single RGB image is challenging problem in image processing these days, as the technology is in its early development stage. However, the demands for 3D technology and 3D reconstruction have been rapidly increasing nowadays. The traditional approach of computer graphics is to create a geometric model in 3D and try to reproduce it onto a 2D image with rendering. The major aim of the study is to create 3D models from 2D RGB image using machine learning techniques to be less computationally complex as compared to any deep learning algorithm. The proposed model has been based on three different modules such as: 2.5D features extraction, mesh generation, and 3D boundary detection. The ShapeNet dataset has been used for comparison. The testing results has shown an accuracy of 90.77 % in the plane class, 85.72% in the chair class, and 72.14% in the automobile class. The proposed model could be applicable to problems where reconstruction of 3D models is required such as: variations in geometric scale, mix of textured, uniformly colored, and reflective surfaces.

Highway Traffic Surveillance Over UAV Dataset via Blob Detection and Histogram of Gradient

Conference Paper

Full-text available

Feb 2023

View Invariant Gesture Tracking and Recognition via 3D Hand Mask and Auto-Landmarks Localization

Thesis

Full-text available

Jul 2023

Hira Ansar

With the passage of time, human-computer interaction evolves. The use of traditional remote systems is replaced with hand gestures. Using hand gestures is an excellent way to communicate with others and operate different devices. This is done by training the system using different hand gestures. Recently, many datasets are available for hand gestures recognition model training used for various purposes. In our proposed model, we have trained our system using both machine and deep learning classifiers. We have adapted various pre-processing, hand detection, and feature descriptors methods for the efficient tracking and recognition of hand gestures. Our proposed work is focused on three applications i-e, hand gestures recognition for controlling smart home appliances, for e-learning and for medical specialists to communicate with the patients and to operate different electro-medical devices. We have used two datasets for each field and achieved remarkable accuracy rates against all datasets.

Context Aware Crowd Tracking and Anomaly Detection via Deep Learning and Social Force Model

Article

Full-text available

Jan 2023

The world’s expanding populace, the variety of human social factors, and the densely populated environment make humans feel uncertain. Individuals need a safety officer who generally deals with security viewpoints for this frailty. Currently, human monitoring techniques are time-consuming, work concentrated, and incapable. Therefore, autonomous surveillance frameworks are necessary for the modern day since they are able to address these problems. Nevertheless, hardships persist. The central concerns incorporate the detachment of the foreground from the scene and the understanding of the contextual structure of the environment for efficiently identifying unusual objects. In our work, we introduced a novel framework to tackle these difficulties by presenting a semantic segmentation technique for separating a foreground object. In our work, Super-pixels are generated using an improved watershed transform and then a conditional random field is implemented to obtain multi-object segmented frames by performing pixel-level labeling. Next, the Social Force model is introduced to extract the contextual structure of the environment via the fusion of a novel chosen particular histogram of an optical stream and inner force model. After using the computed social force, multi-people tracking is performed via three-dimensional template association using percentile rank and non-maximal suppression. Next, multi-object categorization is performed via deep learning Feature Pyramid Network. Finally, by considering the contextual structure of the environment, Jaccard similarity is utilized to make the decision for abnormality detection and identify the unusual objects from the scene. The invented framework is verified through rigorous investigations, and it obtained multi-people tracking efficiency of 92.2% and 89.1% over the UCSD and CUHK Avenue datasets. However, 95.2% and 93.7% abnormality detection efficiency is accomplished over UCSD and CUHK Avenue datasets, respectively.

Automatic 3D Estimation and Deep Geometry Understanding using Voxelization of Point Cloud

Article

Jul 2023

Hamid Ashfaq

Automate Appliances via Gestures Recognition for Elderly Living Assistance

Conference Paper

Feb 2023

An LSTM-Based Approach for Understanding Human Interactions Using Hybrid Feature Descriptors Over Depth Sensors

Article

Full-text available

Nov 2021

Over the past few years, automatic recognition of human interactions has drawn significant attention from researchers working in the field of Artificial Intelligence (AI). And feature extraction is one of the most critical tasks in developing efficient Human Interaction Recognition (HIR) systems. Moreover, recent researches in computer vision suggest that robust features lead to higher recognition accuracies. Hence, an improved HIR system has been proposed in this paper that combines 2D and 3D features extracted using machine learning and deep learning techniques. These discriminative features result in accurate classification and help avoid misclassification of similar interactions. Ten keyframes have been extracted from each video to reduce computational complexity. Next, these frames have been preprocessed using image normalization and noise removal techniques. The Region Of Interest (ROI), which contains the two humans involved in the interaction, has been extracted using motion detection. Then, the human silhouettes have been segmented using the GrabCut algorithm. Next, the extracted silhouettes have been converted into 3D meshes and their heat kernel signatures (HKS) have been obtained to extract key body points. A Convolutional Neural Network (CNN) has been used to extract full-body features from 2D full-body silhouettes. Then, topological and geometric features have been extracted from the key body points. Finally, the combined feature vector has been fed into Long Short-Term Memory (LSTM) and each interaction has been recognized using a Softmax classifier. The proposed system has been validated via extensive experimentation on three challenging RGB+D datasets. The recognition accuracies of 91.63%, 90.54%, and 90.13% have been achieved with the SBU Kinect Interaction, NTU RGB+D, and ISR-UoL 3D social activity datasets respectively. The results of extensive experiments performed on the proposed system suggest that it can be used effectively for various applications, such as security, surveillance, health monitoring, and assisted living.

Semantic Recognition of Human-Object Interactions via Gaussian-Based Elliptical Modeling and Pixel-Level Labeling

Article

Full-text available

Jul 2021

Human-Object Interaction (HOI) recognition, due to its significance in many computer vision-based applications, requires in-depth and meaningful details from image sequences. Incorporating semantics in scene understanding has led to a deep understanding of human-centric actions. Therefore, in this research work, we propose a semantic HOI recognition system based on multi-vision sensors. In the proposed system, the de-noised RGB and depth images, via Bilateral Filtering (BLF), are segmented into multiple clusters using a Simple Linear Iterative Clustering (SLIC) algorithm. The skeleton is then extracted from segmented RGB and depth images via Euclidean Distance Transform (EDT). Human joints, extracted from the skeleton, provide the annotations for accurate pixel-level labeling. An elliptical human model is then generated via a Gaussian Mixture Model (GMM). A Conditional Random Field (CRF) model is trained to allocate a specific label to each pixel of different human body parts and an interaction object. Two semantic feature types that are extracted from each labeled body part of the human and labelled objects are: Fiducial points and 3D point cloud. Features descriptors are quantized using Fisher’s Linear Discriminant Analysis (FLDA) and classified using K-ary Tree Hashing (KATH). In experimentation phase the recognition accuracy achieved with the Sports dataset is 92.88%, with the Sun Yat-Sen University (SYSU) 3D HOI dataset is 93.5% and with the Nanyang Technological University (NTU) RGB+D dataset it is 94.16%. The proposed system is validated via extensive experimentation and should be applicable to many computer-vision based applications such as healthcare monitoring, security systems and assisted living etc.

Affinity-Based Task Scheduling on Heterogeneous Multicore Systems Using CBS and QBICTM

Article

Full-text available

Jun 2021

This work presents the grouping of dependent tasks into a cluster using the Bayesian analysis model to solve the affinity scheduling problem in heterogeneous multicore systems. The non-affinity scheduling of tasks has a negative impact as the overall execution time for the tasks increases. Furthermore, non-affinity-based scheduling also limits the potential for data reuse in the caches so it becomes necessary to bring the same data into the caches multiple times. In heterogeneous multicore systems, it is essential to address the load balancing problem as all cores are operating at varying frequencies. We propose two techniques to solve the load balancing issue, one being designated “chunk-based scheduler” (CBS) which is applied to the heterogeneous systems while the other system is “quantum-based intra-core task migration” (QBICTM) where each task is given a fair and equal chance to run on the fastest core. Results show 30–55% improvement in the average execution time of the tasks by applying our CBS or QBICTM scheduler compare to other traditional schedulers when compared using the same operating system.

A Systematic Deep Learning Based Overhead Tracking and Counting System Using RGB-D Remote Cameras

Article

Full-text available

Jun 2021

Featured Application: The proposed technique is an application for people detection and counting which is evaluated over several challenging benchmark datasets. The technique can be applied in heavy crowd assistance systems that help to find targeted persons, to track functional movements and to maximize the performance of surveillance security. Abstract: Automatic head tracking and counting using depth imagery has various practical applications in security, logistics, queue management, space utilization and visitor counting. However, no currently available system can clearly distinguish between a human head and other objects in order to track and count people accurately. For this reason, we propose a novel system that can track people by monitoring their heads and shoulders in complex environments and also count the number of people entering and exiting the scene. Our system is split into six phases; at first, prepro-cessing is done by converting videos of a scene into frames and removing the background from the video frames. Second, heads are detected using Hough Circular Gradient Transform, and shoulders are detected by HOG based symmetry methods. Third, three robust features, namely, fused joint HOG-LBP, Energy based Point clouds and Fused intra-inter trajectories are extracted. Fourth, the Apriori-Association is implemented to select the best features. Fifth, deep learning is used for accurate people tracking. Finally, heads are counted using Cross-line judgment. The system was tested on three benchmark datasets: the PCDS dataset, the MICC people counting dataset and the GOTPD dataset and counting accuracy of 98.40%, 98%, and 99% respectively was achieved. Our system obtained remarkable results.

A Smart Surveillance System for People Counting and Tracking Using Particle Flow and Modified SOM

Article

Full-text available

May 2021

Based on the rapid increase in the demand for people counting and tracking systems for surveillance applications, there is a critical need for more accurate, efficient, and reliable systems. The main goal of this study was to develop an accurate, sustainable, and efficient system that is capable of error-free counting and tracking in public places. The major objective of this research is to develop a system that can perform well in different orientations, different densities, and different backgrounds. We propose an accurate and novel approach consisting of preprocessing, object detection, people verification, particle flow, feature extraction, self-organizing map (SOM) based clustering, people counting, and people tracking. Initially, filters are applied to preprocess images and detect objects. Next, random particles are distributed, and features are extracted. Subsequently, particle flows are clustered using a self-organizing map, and people counting and tracking are performed based on motion trajectories. Experimental results on the PETS-2009 dataset reveal an accuracy of 86.9% for people counting and 87.5% for people tracking, while experimental results on the TUD-Pedestrian dataset yield 94.2% accuracy for people counting and 94.5% for people tracking. The proposed system is a useful tool for medium-density crowds and can play a vital role in people counting and tracking applications.

Monitoring Real-Time Personal Locomotion Behaviors Over Smart Indoor-Outdoor Environments Via Body-Worn Sensors

Article

Full-text available

May 2021

The monitoring of human physical activities using wearable sensors, such as inertial-based sensors, plays a significant role in various current and potential applications. These applications include physical health tracking, surveillance systems, and robotic assistive technologies. Despite the wide range of applications, classification and recognition of human activities remains imprecise and this may contribute to unfavorable reactions and responses. To improve the recognition of human activities, we designed a dataset in which ten participants (five male and five female) performed 11 different activities wearing three body-worn inertial sensors in different locations on the body. Our model extracts data via a hierarchical feature-based technique. These features include time, wavelet, and time-frequency domains, respectively. Stochastic gradient descent (SGD) is then introduced to optimize selective features. The selected features with optimized patterns are further processed by multi-layered kernel sliding perceptron to develop adaptive learning for the classification of physical human activities. Our proposed model was experimentally evaluated and applied on three benchmark datasets: IM-WSHA, a self-annotated dataset, PAMAP2 dataset which is comprised of daily living activities, and an HuGaDB, a dataset which contains physical activities for aging people. The experimental results show that the proposed method achieves better results and outperforms others in terms of recognition accuracy, achieving an accuracy rate of 83.18%, 94.16%, and 92.50% respectively, when IM-WSHA, PAMAP2, and HuGaDB datasets are applied.

Adaptive Pose Estimation for Gait Event Detection Using Context-Aware Model and Hierarchical Optimization

Article

Full-text available

Apr 2021

To understand daily events accurately, adaptive pose estimation (APE) systems require a robust context-aware model and optimal feature selection methods. In this paper, we propose a novel gait event detection (GED) system that consists of sali-ency silhouette detection, a robust body parts model and a 2D stick-model followed by a hierarchical optimization algorithm. Furthermore, the most prominent context-aware features such as energy, 0–180° intensity and distinct moveable features are proposed by focusing on invariant and localized characteristics of human postures in different event classes. Finally, we apply Grey Wolf optimization and a genetic algorithm to discriminate complex postures and to provide appropriate labels to each event. In order to evaluate the performance of proposed GED, two public benchmark datasets, UCF101 and YouTube, are examined via the n-fold cross validation method. For the two benchmark datasets, our proposed method detects the human body key points with 82.4% and 83.2% accuracy respectively. Also, it extracts the context-aware features and finally recognizes the gait events with 82.6% and 85.0% accuracy, respectively. Compared with other well-known statistical and state-of-the-art methods, our proposed method outperforms other similarly tasked methods in terms of posture detection and recognition accuracy (PDF) Adaptive Pose Estimation for Gait Event Detection Using Context-Aware Model and Hierarchical Optimization. Available from: https://www.researchgate.net/publication/351047404_Adaptive_Pose_Estimation_for_Gait_Event_Detection_Using_Context-Aware_Model_and_Hierarchical_Optimization [accessed Apr 22 2021].

Pose Estimation and Detection for Event Recognition using Sense-Aware Features and Adaboost Classifier

Conference Paper

Full-text available

Jan 2021

Smartphone Inertial Sensors for Human Locomotion Activity Recognition based on Template Matching and Codebook Generation

Conference Paper

Sep 2021

Exploring Alternatives to Softmax Function

Conference Paper

Jan 2021

A Novel Deep Learning Model for Understanding Two-Person Interactions Using Depth Sensors

Abstract and Figures

Recommended publications

Combining multiple features for color texture classification

Body-worn Hybrid-Sensors based Motion Patterns Detection via Bag-of-features and Fuzzy Logic Optimiz...

Crowd Anomaly Detection in Public Surveillance via Spatio-temporal Descriptors and Zero-Shot Classif...

Body-worn Hybrid-Sensors based Motion Patterns Detection via Bag-of-features and Fuzzy Logic Optimiz...

A Smart Surveillance System for Pedestrian Tracking and Counting using Template Matching