Figure 5 - uploaded by Sergio Escalera
Content may be subject to copyright.
Types of gestures. We created a classification of gesture types according to purpose defined by three complementary axes: communication, expression and action. We selected 85 gesture vocabularies, including Italian gestures, Indian Mudras, Sign language for the deaf, diving signals, pantomimes, and body language.

Types of gestures. We created a classification of gesture types according to purpose defined by three complementary axes: communication, expression and action. We selected 85 gesture vocabularies, including Italian gestures, Indian Mudras, Sign language for the deaf, diving signals, pantomimes, and body language.

Source publication
Article
Full-text available
This paper surveys the state of the art on multimodal gesture recognition and introduces the JMLR special topic on gesture recognition 2011-2015. We began right at the start of the Kinect™ revolution when inexpensive infrared cameras providing image depth recordings became available. We published papers using this technology and other more conventi...

Context in source publication

Context 1
... instance a gesture vocabulary may consist of the signs to referee volleyball games or the signs to represent small animals in the sign language for the deaf. We selected lexicons from nine categories corresponding to various settings or application domains ( Figure 5): ...

Similar publications

Preprint
Full-text available
p>Nearly 15% of the global population is affected by disabilities impacting mobility. Monitoring foot pressure distribution during gait is a fundamental aspect of evaluating rehabilitation. Wearable systems provide a portable alternative to stationary equipment monitoring gait without laboratory space limitations. However, wearable sensors in some...
Chapter
Full-text available
With the rapid development of semiconductor technology and mobile communication networks, inertial sensors have higher performance, and correspondingly, the demand for human behavior recognition is increasing. At this stage, the mainstream research methods mainly analyze images in video streams. Although they can achieve good recognition results, d...
Article
Full-text available
Night shift workers are often associated with circadian misalignment and physical discomfort, which may lead to burnout and decreased work performance. Moreover, the irregular work hours can lead to significant negative health outcomes such as poor eating habits, smoking, and being sedentary more often. This paper uses commercial wearable sensors t...
Chapter
Full-text available
Home Intelligent Assistants (HIAs) typically integrate several types of healthcare and well-being solutions that include mobile applications, sensors or wearables. These solutions are usually connected to the HIA through the Internet designing an Internet of Things (IoT) ecosystem. This paper presents an IoT healthcare ecosystem for smart home envi...
Article
Full-text available
Lumbar loading causes increased intervertebral pressure and is an important factor in low back pain. However, it is difficult to quantitatively judge the actions that affect lumbar load and the magnitude of lumbar load increase. Low back pain occurs not only in the workplace but also during activities of daily living. Therefore, it is necessary to...

Citations

... This is not a new research problem in computer vision and automation. However, it still contains many challenges such as input image quality, lighting, hand occlusion, and speed of hand gestures [8]. Despite impressive results on hand gesture recognition, CNNs require a very large amount of training data for model building. ...
Article
Full-text available
Hand gesture recognition has great applications in human-computer interaction (HCI), human-robot interaction (HRI), and supporting the deaf and mute. To build a hand gesture recognition model using deep learning (DL) with high results then needs to be trained on many data and in many different conditions and contexts. In this paper, we publish the TQU-HG dataset of large RGB images with low resolution (640×480) pixels, low light conditions, and fast speed (16 fps). TQU-HG dataset includes 60,000 images collected from 20 people (10 male, 10 female) with 15 gestures of both left and right hands. A comparative study with two branches: i) based on Mediapipe TML and ii) Based on convolutional neural networks (CNNs) (you only look once (YOLO); YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLO-Nas, single shot multiBox detector (SSD) VGG16, residual network (ResNet)18, ResNext50, ResNet152, ResNext50, MobileNet V3 small, and MobileNet V3 large), the architecture and operation of CNNs models are also introduced in detail. We especially fine-tune the model and evaluate it on TQU-HG and HaGRID datasets. The quantitative results of the training and testing are presented (F1-score of YOLOv8, YOLO-Nas, MobileNet V3 small, ResNet50 is 98.99%, 98.98%, 99.27%, 99.36%, respectively on the TQU-HG dataset and is 99.21%, 99.37%, 99.36%, 86.4%, 98.3%, respectively on the HaGRID dataset). The computation time of YOLOv8 is 6.19 fps on the CPU and 18.28 fps on the GPU.
... These mechanisms rely on generic hand tracking algorithms or recognition engines of a minimal set of hand gestures (Yang et al. 2015). For the mainstream, depth cameras became widely available with the Kinect revolution (Escalera et al. 2016). This was one of the anchor points that increased the public adoption of gesture recognition (Escalera et al. 2016). ...
... For the mainstream, depth cameras became widely available with the Kinect revolution (Escalera et al. 2016). This was one of the anchor points that increased the public adoption of gesture recognition (Escalera et al. 2016). ...
Conference Paper
This paper explores the potential of gesture interaction as an alternative control concept for doors, windows and sliding systems in smart homes. In a first step, a technical prototype was built that enables to open and close door and window elements with a hand-swiping gesture. In a second step, a user study with N = 95 participants was conducted to explore the perceived usefulness of the developed solution using a questionnaire with 24 items. The results showed that 78 percent of the participants liked the concept of contactless gesture control of doors, windows and sliding systems. The reluctance of the remaining group could be traced back to a missing experience with smart control concepts (e.g., voice assistants) (t-test: Spearman’s rs = .27, p = .044) and the belief that gestures are hard to remember (chi-square test: α < .01, p = .007). The study also confirmed that the implemented control concept and gestures were perceived as natural and intuitively understandable.
... The metric used to evaluate the performances of the model was the Jaccard index or intersection over union value [37,38]. The index is often used for segmentation classifiers and was computed to analogize a set of predicted labels with a set of the corresponding Figure 5. Three-dimensional convolutional neural network (3DCNN) with long short-term memory (LSTM) for dynamic hand gesture recognition. ...
... The metric used to evaluate the performances of the model was the Jaccard index or intersection over union value [37,38]. The index is often used for segmentation classifiers and was computed to analogize a set of predicted labels with a set of the corresponding true labels. ...
Article
Full-text available
This paper introduces a multi-class hand gesture recognition model developed to identify a set of hand gesture sequences from two-dimensional RGB video recordings, using both the appearance and spatiotemporal parameters of consecutive frames. The classifier utilizes a convolutional-based network combined with a long-short-term memory unit. To leverage the need for a large-scale dataset, the model deploys training on a public dataset, adopting a technique known as transfer learning to fine-tune the architecture on the hand gestures of relevance. Validation curves performed over a batch size of 64 indicate an accuracy of 93.95% (±0.37) with a mean Jaccard index of 0.812 (±0.105) for 22 participants. The fine-tuned architecture illustrates the possibility of refining a model with a small set of data (113,410 fully labelled image frames) to cover previously unknown hand gestures. The main contribution of this work includes a custom hand gesture recognition network driven by monocular RGB video sequences that outperform previous temporal segmentation models, embracing a small-sized architecture that facilitates wide adoption.
... As gestures become more diversified and enriched, our instinctive intelligence will recognize basic actions and associate them with generic behaviors. The challenge of action recognition is mainly related to the difficulty of extracting body silhouettes from foreground rigid objects to focus on their emotions [96]. Occlusions that occur between different object parts can lead to a significant decrease in performance. ...
Article
Full-text available
The research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing universality of deep multimodal learning. This involves the development of models capable of processing and analyzing the multimodal information uniformly. Unstructured real-world data can inherently take many forms, also known as modalities, often including visual and textual content. Extracting relevant patterns from this kind of data is still a motivating goal for researchers in deep learning. In this paper, we seek to improve the understanding of key concepts and algorithms of deep multimodal learning for the computer vision community by exploring how to generate deep models that consider the integration and combination of heterogeneous visual cues across sensory modalities. In particular, we summarize six perspectives from the current literature on deep multimodal learning, namely: multimodal data representation, multimodal fusion (i.e., both traditional and deep learning-based schemes), multitask learning, multimodal alignment, multimodal transfer learning, and zero-shot learning. We also survey current multimodal applications and present a collection of benchmark datasets for solving problems in various vision domains. Finally, we highlight the limitations and challenges of deep multimodal learning and provide insights and directions for future research.
... Human-Computer interaction is an interdisciplinary research area focusing on designing computational technologies to make the interaction between humans and computers possible. Hand gesture recognition is its sub-field in which computer vision and artificial intelligence have aided to provide nonverbal communication between humans and computers by identifying significant movements of the human hands [1]. Though there are a variety of applications of hand gesture recognition, accurate recognition remains a challenging task [1]. ...
... Hand gesture recognition is its sub-field in which computer vision and artificial intelligence have aided to provide nonverbal communication between humans and computers by identifying significant movements of the human hands [1]. Though there are a variety of applications of hand gesture recognition, accurate recognition remains a challenging task [1]. ...
Article
Full-text available
Hearing-impaired people use sign language to express their thoughts and emotions and reinforce information delivered in daily conversations. Though they make a significant percentage of any population, the majority of people can’t interact with them due to limited or no knowledge of sign languages. Sign language recognition aims to detect the significant motions of the human body, especially hands, analyze them and understand them. Such systems may become life-saving when hearing-challenged people are in desperate situations like heart attacks, accidents, etc. In the present study, deep learning-based hand gesture recognition models are developed to accurately predict the emergency signs of Indian Sign Language (ISL). The dataset used contains the videos for eight different emergency situations. Several frames were extracted from the videos and are fed to three different models. Two models are designed for classification, while one is an object detection model, applied after annotating the frames. The first model consists of a three-dimensional convolutional neural network (3D CNN), while the second comprises of a pre-trained VGG-16 and a recurrent neural network with a long short-term memory (RNN-LSTM) scheme. The last model is based on YOLO (You Only Look Once) v5, an advanced object detection algorithm. The prediction accuracies of the classification models were 82% and 98%, respectively. YOLO based model outperformed the rest and achieved an impressive mean average precision of 99.6%.
... Gestures serve as a natural communication modality between humans and automated gesture recognition has recently gained interest due to its wide range of applications in human-computer interaction systems, sign language recognition, patient monitoring, security, entertainment, robotics, etc [1][2][3][4]. In general, hand gesture recognition can be broadly classified into three groups: 1) static 2) trajectory and 3) continuous recognition. ...
... Previously, we have reported spatio-temporal trajectory-based gesture recognition using 3D Integral imaging and deep neural networks [3]. As compared to static and trajectory approaches, the continuous gesture case consists of a series of similar/dissimilar gestures rather than a single gesture [2,4,5]. Segmenting the individual gestures from the continuous video stream can be challenging since the gestures can happen in an arbitrary order and their duration is generally unknown [4]. ...
... As compared to static and trajectory approaches, the continuous gesture case consists of a series of similar/dissimilar gestures rather than a single gesture [2,4,5]. Segmenting the individual gestures from the continuous video stream can be challenging since the gestures can happen in an arbitrary order and their duration is generally unknown [4]. Numerous approaches have been proposed for temporal segmentation and recognition of gestures from continuous gesture sequences including the use of Dynamic Time Warping (DTW) [6], Hidden Markov Models [7], continuous dynamic programming [8] as well as approaches based on deep neural networks [9], etc. ...
Article
Full-text available
In this paper, we introduce a deep learning-based spatio-temporal continuous human gesture recognition algorithm under degraded conditions using three-dimensional (3D) integral imaging. The proposed system is shown as an efficient continuous human gesture recognition system for degraded environments such as partial occlusion. In addition, we compare the performance between the 3D integral imaging-based sensing and RGB-D sensing for continuous gesture recognition under degraded environments. Captured 3D data serves as the input to a You Look Only Once (YOLOv2) neural network for hand detection. Then, a temporal segmentation algorithm is employed to segment the individual gestures from a continuous video sequence. Following segmentation, the output is fed to a convolutional neural network-based bidirectional long short-term memory network (CNN-BiLSTM) for gesture classification. Our experimental results suggest that the proposed deep learning-based spatio-temporal continuous human gesture recognition provides substantial improvement over both RGB-D sensing and conventional 2D imaging system. To the best of our knowledge, this is the first report of 3D integral imaging-based continuous human gesture recognition with deep learning and the first comparison between 3D integral imaging and RGB-D sensors for this task.
... Gesture recognition is an area in computer vision and machine learning that aims to analyze visual data and develop the machine's ability to understand them [34]. According to [34], there are studies, since the 1980s, reporting static gesture recognition tasks [28], body part detection [25], actions and activities recognition [65] and sign language recognition (SLR). ...
... Gesture recognition is an area in computer vision and machine learning that aims to analyze visual data and develop the machine's ability to understand them [34]. According to [34], there are studies, since the 1980s, reporting static gesture recognition tasks [28], body part detection [25], actions and activities recognition [65] and sign language recognition (SLR). In the gesture recognition hierarchy, as presented by [31], SLR is considered the most important gesture types to be recognized. ...
Article
Full-text available
Sign language recognition is considered the most important and challenging application in gesture recognition, involving the fields of pattern recognition, machine learning and computer vision. This is mainly due to the complex visual–gestural nature of sign languages and the availability of few databases and studies related to automatic recognition. This work presents the development and validation of a Brazilian sign language (Libras) public database. The recording protocol describes (1) the chosen signs, (2) the signaller characteristics, (3) the sensors and software used for video acquisition, (4) the recording scenario and (5) the data structure. Provided that these steps are well defined, a database with more than 1000 videos of 20 Libras signs recorded by twelve different people is created using an RGB-D sensor and an RGB camera. Each sign was recorded five times by each signaller. This corresponds to a database with 1200 samples of the following data: (1) RGB video frames, (2) depth, (3) body points and (4) face information. Some approaches using deep learning-based models were applied to classify these signs based on 3D and 2D convolutional neural networks. The best result shows an average accuracy of 93.3%. This paper presents an important contribution for the research community by providing a publicly available sign language dataset and baseline results for comparison.
... Several approaches can be used to classify the sensed signals into recognized gestures. Classic machine-learning methodologies were based on a twostep procedure: first, extracting the relevant features from the input signals; and, second, the classification of such features into the corresponding gestures [14]. Humans associate gestures with meanings. ...
Preprint
Full-text available
Gesture recognition has become pervasive in many interactive environments. Recognition based on Neural Networks often reaches higher recognition rates than competing methods at a cost of a higher computational complexity that becomes very challenging in low resource computing platforms such as microcontrollers. New optimization methodologies, such as quantization and Neural Architecture Search are steps forward for the development of embeddable networks. In addition, as neural networks are commonly used in a supervised fashion, labeling tends to include bias in the model. Unsupervised methods allow for performing tasks as classification without depending on labeling. In this work, we present an embedded and unsupervised gesture recognition system, composed of a neural network autoencoder and K-Means clustering algorithm and optimized through a state-of-the-art multi-objective NAS. The present method allows for a method to develop, deploy and perform unsupervised classification in low resource embedded devices.
... Therefore, the extraction of time dimension information specific to video information is not perfect. As the extraction of time dimension information needs to process more data and the model is more complex, effective gesture recognition is still very challenging [5]. Azar et al. [6] proposed a multi-stream convolution neural network (CNN) framework. ...
Article
Full-text available
Gesture is a natural form of human communication, and it is of great significance in human–computer interaction. In the dynamic gesture recognition method based on deep learning, the key is to obtain comprehensive gesture feature information. Aiming at the problem of inadequate extraction of spatiotemporal features or loss of feature information in current dynamic gesture recognition, a new gesture recognition architecture is proposed, which combines feature fusion network with variant convolutional long short‐term memory (ConvLSTM). The architecture extracts spatiotemporal feature information from local, global and deep aspects, and combines feature fusion to alleviate the loss of feature information. Firstly, local spatiotemporal feature information is extracted from video sequence by 3D residual network based on channel feature fusion. Then the authors use the variant ConvLSTM to learn the global spatiotemporal information of dynamic gesture, and introduce the attention mechanism to change the gate structure of ConvLSTM. Finally, a multi‐feature fusion depthwise separable network is used to learn higher‐level features including depth feature information. The proposed approach obtains very competitive performance on the Jester dataset with the classification accuracies of 95.59%, achieving state‐of‐the‐art performance with 99.65% accuracy on the SKIG (Sheffifield Kinect Gesture) dataset.
... Because there are many factors affecting gesture recognition, such as the changes of illumination conditions, the complexity of the background environment, and the inconsistency between the standard gestures and the gestures in the real environment, the effective gesture recognition based on video is extremely difficult. 3 Dynamic gestures are not only the accumulation of single-hand posture but also the need to be combined with a series of movements of the arm and hands, so the spatiotemporal information plays a great role in the correct recognition of the gestures. Unlike behavior recognition which may be promoted by the recognition of background information, hand gesture recognition should pay more attention to the movement of arms and hands. ...
... S K and feed it into the 3D pooling layer. The temporal depth [T1,T2,T3] is [1,3,6] in first TTL and [1,3,4] in latter TTLs of T3D, 5 respectively. ...
... S K and feed it into the 3D pooling layer. The temporal depth [T1,T2,T3] is [1,3,6] in first TTL and [1,3,4] in latter TTLs of T3D, 5 respectively. ...
Article
Gesture recognition aims at understanding dynamic gestures of the human body and is one of the most important ways of human–computer interaction; to extract more effective spatiotemporal features in gesture videos for more accurate gesture classification, a novel feature extractor network, spatiotemporal attention 3D DenseNet is proposed in this study. We extend DenseNet with 3D kernels and Refined Temporal Transition Layer based on Temporal Transition Layer, and we also explore attention mechanism in 3D ConvNets. We embed the Refined Temporal Transition Layer and attention mechanism in DenseNet3D, named the proposed network “spatiotemporal attention 3D DenseNet.” Our experiments show that our Refined Temporal Transition Layer performs better than Temporal Transition Layer and the proposed spatiotemporal attention 3D DenseNet in each modality outperforms the current state-of-the-art methods on the ChaLearn LAP Large-Scale Isolated gesture dataset. The code and pretrained model are released in https://github.com/dzf19927/STA3D.