ArticlePDF Available

Hand gesture recognition using multimodal data fusion and multiscale parallel convolutional neural network for human–robot interaction

January 2020
Expert Systems 38(6)

January 2020
38(6)

DOI:10.1111/exsy.12490

Authors:

Qing Gao

Sun Yat-Sen University

Jinguo Liu

Shenyang Institute of Automation,CAS

Zhaojie Ju

University of Portsmouth

Hand gesture recognition plays an important role in human–robot interaction. The accuracy and reliability of hand gesture recognition are the keys to gesture‐based human–robot interaction tasks. To solve this problem, a method based on multimodal data fusion and multiscale parallel convolutional neural network (CNN) is proposed in this paper to improve the accuracy and reliability of hand gesture recognition. First of all, data fusion is conducted on the sEMG signal, the RGB image, and the depth image of hand gestures. Then, the fused images are generated to two different scale images by downsampling, which are respectively input into two subnetworks of the parallel CNN to obtain two hand gesture recognition results. After that, hand gesture recognition results of the parallel CNN are combined to obtain the final hand gesture recognition result. Finally, experiments are carried out on a self‐made database containing 10 common hand gestures, which verify the effectiveness and superiority of the proposed method for hand gesture recognition. In addition, the proposed method is applied to a seven‐degree‐of‐freedom bionic manipulator to achieve robotic manipulation with hand gestures.

Astronaut assistant robot platform

…

The pipeline of human–robot interaction method

…

Hand gesture interaction flow chart

…

Finite state machine model diagram of “Hand gesture 1”

…

Data fusion process

…

Figures - available from: Expert Systems

This content is subject to copyright. Terms and conditions apply.

Content uploaded by Jinguo Liu

Content may be subject to copyright.

Received: 29 May 2019 Revised: 8 July 2019 Accepted: 4 October 2019

DOI: 10.1111/exsy.12490

SPECIAL ISSUE PAPER

Hand gesture recognition using multimodal data fusion and

multiscale parallel convolutional neural network for

human–robot interaction

Qing Gao1,2 Jinguo Liu1Zhaojie Ju1,3

1State Key Laboratory of Robotics, Shenyang

Institute of Autom ation, Institutes f or Robotics

and Intelligent Manufacturing, Chinese

Academy of Sciences, Shenyang, China

2University of Chinese Academy of Sciences,

Beijing, China

3School of Computing, University of

Portsmouth, Portsmouth, UK

Corres pon den ce

Jinguo Liu, State Key Laboratory of Robotics,

Shenyang Institute of Automation, Institutes

for Robo tics and Intelligent Manufacturing,

Chinese Academy of Sciences, Shenyang

110016, China.

Email: liujinguo@sia. cn

Funding information

CAS Interdisciplinary Innovation Team,

Grant/Award Number: JCTD-2018-11;

National Key R&D Program of China,

Grant/Award Number: 2018YFB1304600;

Natural Science Foundation of China,

Grant/Award Number: 51575412 and

51775541; the EU Seventh Framework

Programme (FP7)-ICT, Grant/Award Number:

611391

Abstract

Hand gesture recognition plays an important role in human–robot interaction. The accuracy and

reliability of hand gesture recognition are the keys to gesture-based human–ro bot interaction

tasks. To solve this problem, a method based on multimodal data fusion and multiscale parallel

convolutional neural network (CNN) is proposed in this paper to improve the accuracy and

reliability of hand gesture recognition. First of all, data fusion is conducted on the sEMG signal,

the RGB image, and the depth image of hand gestures. Then, the fused images are generated to

two differe nt sc ale images by downsampling, whic h are respectively input into two subnetworks

of the parallel CNN to obtain two hand gesture recognition results. After that, hand gesture

recognition results of the parallel CNN are combined to obtain the final hand gesture recognition

result. Finally, experiments are carried out on a self-made database containing 10 common

hand gestures, which verify the effectiveness and superiority of the proposed method for hand

gesture recognition. In addition, the proposed method is applied to a seven-degree-of-freedom

bionic manipulator to achieve robotic manipulation with hand gestures.

KEYWORDS

hand gesture recognition, multimodal data fusion, parallel CNN, sEMG signal

1INTRODUCTION

In gesture-based space human–robot interaction (HRI; Malima et al., 2006; Raheja et al., 2010), reliability and security are the key to ensuring the

normal operation of HRI (Liu et al., 2016b). The traditional methods of acquiring hand gesture information mainly include collecting RGB images

of hand gestures with a colour camera, collecting depth information of hand gestures with a depth sensor, and collecting electromyography

information of hand gestures with a sEMG device. Each of these methods has its advantages and disadvantages. For example, the RGB image

of the hand gesture has rich performance features, but it cannot show the 3D information of the hand gesture. The depth image of the hand

gesture contains 3D features, but it lacks sufficient performance features. Moreover, the RGB image and the depth image cannot be recognized

at the most of time in the case where the hand is severely blocked. Although the use of SEMG device does not need to consider the occlusion

problem of gestures, but the noise and interference of the device are large, and it is often impossible to obtain a high hand gesture recognition

accuracy. Various information about the hand gesture can be utilized, and the recognition accuracy of the hand gesture can be improved by

fusing multimodal gesture information. For example, Chen et al. (2015) combine depth camera data and inertial sensor data for the recognition of

27 human body movements, which improves the recognition accuracy. Miao et al. (2017) combine RGB, depth, and optical flow information to

identify dynamic gestures of the human upper bodies. Kopuklu et al. (2018) combine the colour map and the optical flow graph of dynamic hand

gestures and achieve good results on the Jester, ChaLearn LAP IsoGD and NVIDIA Dynamic Hand Gesture Datasets. For the way of multimodal

data fusion, according to the stage of fusion, it is mainly divided into data level fusion (Kopuklu et al., 2018), feature level fusion (Miao et al.,

2017), and decision level fusion (Simonyan & Zisserman, 2014). Among them, data level fusion can achieve the highest fusion efficiency. It has

https:// doi.org/10.1111/exsy.12490

2of12 GAO ET AL.

two advantages: (a) training only requires a single-channel network and (b) automatically establish pixel-wise correspondence between different

modalities. Built on the above analysis, the data level fusion of RGB information, depth information, and sEMG information is proposed to

improve the recognition accuracy of hand gestures in this paper.

At present, work has achieved great achievement in the field of image recognition (Gao et al., 2017a). In addition, the use of multiscale

input in parallel networks can also improve the recognition accuracy of images effectively. For example, Karpathy et al. (2014) prove that

better experimental results can be obtained by a dual-stream convolutional neural network (CNN) with raw and spatially clipped video streams.

In Molchanov et al. (2015), a parallel 3DCNN is designed, and the original data and the data after downsampling are input into two parallel

subnetworks respectively, which realizes the improvement of the dynamic hand gesture recognition accuracy. Take into account these, a parallel

CNN structure is proposed. And the fused data and the downsampled data are taken as input to the two parallel subnetworks. Finally, the output

results are brought to gether to obtain the final hand gesture recognition ac curacy.

The contributions of this paper are concluded as follows.

(a) The RGB image, the depth image, and the SEMG signal of hand gesture are combined to deal with hand gesture recognition in the case

of hand occlusion and to improve the reliability and safety of gesture-based HRI.

(b) A data-level fusion method is designed to convert the SEMG signal into an image and then fuse it with the RGB image and the depth image.

(c) A multiscale parallel convolutional neural network (MPN) framework is designed to improve the recognition accuracy of hand gestures.

(d) A set of hand gesture database containing 10 common HRI hand gestures is made. This database contains RGB images, depth images,

and sEMG information, which can be used for verification of the proposed method.

(e) The proposed method is applied to the control of a seven-degree-of-freedom bionic manipulatorto realize gesture-based space HRI.

The rest of this paper is structured as follows. Section 2 introduces the gesture-based space HRI system. Section 3 presents the data fusion

method. Section 4 introduces the MPN framework. Section 5 is the experimental results and discussion, and the conclusion remark and future

work are adopted in Section 6.

2GESTURE-BASED SPACE HRI SYSTEM

Security and reliability play important roles in space HRI tasks. This chapter is aimed at a bionic manipulator on an astronaut assistant robot (AAR;

Gao et al., 2017b; Liu et al., 2016a), using hand gestures to control and operate it.

2.1 Space robot

In the space station, due to the heavy workload and the limited number of astronauts, the AAR has an obligation to assist the astronauts in

completing some space missions. For example, it can assist astronauts in conducting space experiments, help astronauts take some tools, and

manage the safety of astronauts. Therefore, we design an AAR for using in the space station cabin. As shown in Figure 1, its platform consists of

an AAR, a bionic manipulator, and a simulated air bearing table.

2.1.1 AAR

This space robot is capable of free flight in the cabin and is equipped with 12 ducted fans as its drives for six-degree-of-freedom motion in

microgravity environments.

2.1.2 Air bearing table

In the ground experiment, the simulated air bearing table can help the AAR to realize simulated microgravity movement in the horizontal direction.

2.1.3 Bionic manipulator

A bionic manipulator is mounted on the AAR, and it can be utilized to grab some tools or objects. Its structure consists of fingers, palm, and wrist

with seven degrees of freedom (five degrees of freedom in fingers and two degrees of freedom in the wrist). It is extremely important to control

the motion of the bionic manipulator. Traditional control methods, such as handles, joysticks, and consoles, are complicated and inconvenient to

operate. Because the structure of the manipulator is similar to a human hand, it is very convenient to directly control it with hand gestures.

2.2 Gesture-based HRI

Hand gesture recognition technology is very important in gesture-based HRI (Gao et al., 2019). Current hand gesture recognition methods are

mainly based on wearable devices and vision (Rautaray & Agrawal, 2015; Smith et al., 2000). Both methods have their pros and cons. For example,

wearable-based hand gesture recognition is limited by the device, and the interference is large. Although vision-based hand gesture recognition

is susceptible to occlusion, security and reliability are the priority in space missions. Therefore, in this paper, these two methods are combined to

improve the recognition accuracy of hand gestures and to cope with various interference, such as occlusion and signal interference.

GAO ET AL.3of12

FIGURE 1 Astronaut assistant robot platform

FIGURE 2 The pipeline of

human–robot interaction method

For the acquisition of hand gesture signals, we use the MYO armband to collect the sEMG signals of hand gestures (Benalcázar et al., 2017).

The sensors on the MYO band capture the bioelectrical changes that occur when the user's arm muscles move, thus can judge the wearer's hand

gestures, and then send the recognition results to the robot via Bluetooth. And use the Kinect to collect the RGB and depth images of hand

gestures (Ren et al., 2013). Kinect is a 3D sensor that includes a colour camera and a TOF depth camera. It can capture hand gesture movements

in 3D space in real time and then combine the three kinds of information and transmit them to the hand gesture recognition model based on deep

neural network to identify the corresponding hand gestures. After that, the recognized result is transmitted to the AAR, thereby enabling the

human hand to control the bionic manipulator so that it can simulate the human hand gestures. The specific method flow chart is shown in Figure 2.

2.3 HRI hand gesture dataset

Because the bionic manipulator has seven degrees of freedom, it can simulate the movement of human fingers and wrist; 10 common static hand

gestures are designed, including the gesture of the fingers or the gesture of fingers and wrist. These hand gestures and their semantics are shown

in Table 1. Among these hand gestures, hand gesture 1 and hand gesture 2 can control the initial action and stop action of the manipulator.

Hand gestures 3–6 can control the movement direction of the manipulator, that is, left, right, upward, and downward. The manipulator can learn

different ways of grasping from hand gesture 7 to hand gesture 10, which help to grasp different objects. By simulating these hand gestures, the

manipulator can perform some conventional directional motion operations and grab operations.

2.4 Stability analysis of HRI

For a complete gesture-based HRI system, the stability of the system should also be considered. When astronaut hand gestures are incorrect

or at the transition process between different hand gestures, or when the system just starts and the hand gesture changes suddenly, the HRI

system may be unstable. Therefore, it is necessary in order to design a method to map astronaut's hand gestures to the bionic manipulator to

filter out unstable hand gesture information and to use stable hand gesture output for controlling the manipulator.

4of12 GAO ET AL.

TABLE 1 Human–robot interaction hand gesture dataset Hand gesture number Hand gesture semantic Hand gesture diagram

Hand gesture 1 Relax

Hand gesture 2 Fist

Hand gesture 3 Left

Hand gesture 4 Right

Hand gesture 5 Downward

Hand gesture 6 Upward

Hand gesture 7 Grab a cylinder

Hand gesture 8 Grab a ball

Hand gesture 9 Pinch

Hand gesture 10 Buckle

FIGURE 3 Hand gesture interaction flow

chart

Finite state machine (FSM) is a mathematical model employed to represent finite states and transitions and actions between these states. It is

primarily used for the parsing of programming languages. The gesture-based HRI can also be taken into account as a language form that expresses

semantic commands through hand gestures. Therefore, it is very suitable to use FSM to model the semantics of different hand gestures.

For each predefined hand gesture, a FSM model needs to be created. As shown in Figure 3, in the process of interaction between the astronaut

and the bionic manipulator, first, each frame data containing a hand gesture is collected by Kinect and MYO. Second, the hand gesture recognition

algorithm is utilized to recognize the type of the hand gesture, and then the processed hand gesture information is input to the corresponding

FSM model. This can output the predefined hand gesture.

Taking hand gesture 1 as an example, the state transition relationship of the hand gesture model is established by using the FSM method, and

it is illustrated in Figure 4. The implementation principle of hand gesture 1 FSM model is shown as follows: The initial time of the system is in

state 1. At this time, the bionic manipulator is not controlled. When the astronaut enters a hand gesture and the system recognizes that the hand

gesture is ‘‘Hand gesture 1,’’ it transitions to transition state 2 and clears counter 1 and counter 2. At this time, the bionic manipulator has not yet

been controlled. If a hand gesture other than ‘‘Hand gesture 1’’ is recognized in this state, it returns to the initial state 1. Keep the ‘‘Hand gesture

1’’ for more than five consecutive frames, then enter the working state 3. If the ‘‘Hand gesture 1’’ is recognized in this state, the hand gesture is

output to control the bionic manipulator.

3DATA FUSION METHOD

Data fusion method plays an important role in multimodal hand gesture recognition tasks (EL-SAYED, 2015; Liu et al., 2014). According to the

order of data fusion, its methods can be subdivided into data level fusion, feature level fusion, and decision level fusion (Kopuklu et al., 2018).

In this paper, data level fusion is utilized to fuse the RGB, depth, and sEMG signals of hand gestures. Compared with feature level fusion and

decision level fusion, data level fusion has the following advantages: (a) The fused data can be extracted by a single-channel deep neural network,

which can effectively reduce the number of parameters and improve the speed of the algorithm and (b) achieve pixel-wise correspondence

GAO ET AL.5of12

FIGURE 4 Finite state machine model diagram of

‘‘Hand gesture 1’’

between multimodal data to improve data fusion efficiency. This section introduces the data fusion method of RGB, depth, and sEMG signals of

the mentioned 10 hand gestures.

3.1 Data correspondence

Fusion of multimodal data needs to maintain the correspondence of the fused data structure and sampling time. The RGB and depth images

acquired by Kinect is 30 fps with a resolution of 640 ×480. The sEMG signal collected by MYO armband is 16 channels, and the frequency

is 1000 Hz (Boyali et al., 2015). The data of the images and the sEMG signal are different in both spatial structure and sampling frequency, so

they cannot be directly fused. Therefore, it is necessary to convert these three kinds of data into a consistent structure and then fuse them. The

process of conversion is shown as follows.

3.1.1 Convert sEMG signals into images

Data collected by the MYO armband in each second is filtered to obtain a matrix M, which indicates the strength of the myoelectric signal. The

size of M is 16 ×1,000. Convert it to a grayscale image with a pixel size of 16 ×1,000 using Equation (1).

sx,y=255 ×m(x,y)

mmax(x,y)−mmin (x,y),(1)

where s(x,y)is the pixel value of the coordinate (x,y)in the image S.s(x,y)∈[0,255],x∈[1,1,000],y∈[1,16].m(x,y)is the value of the sEMG

signal with the coordinate (x,y)in the matrix M.m(x,y)∈[mmin(x,y),mmax(x,y)],mmin(x,y)and mmax (x,y)are the minimum value and maximum

value in the matrix M.

3.1.2 Cut the image S

The image Sis cut into several of small images with a pixel size of 16 ×16. Then, 10 images are uniformly extracted from them, and they are

converted into images with a size of 160 ×160 by upsampling. That is, 10 frames of grayscale images with a pixel size of 160 ×160 are obtained

from the MYO armband per second.

3.1.3 Sampling and cutting hand gesture images in RGB and depth images

Taking the RGB images as an example, 10 frames of images are uniformly sampled in 30 frames of images acquired by the Kinect per second.

These 10 images are cut into images containing hand gestures with a pixel size of 160 ×160. That is, 10 RGB images with a pixel size of 160 ×160

can be obtained every second from Kinect's colour camera. The process of depth images is the same as that of RGB images, 10 depth images

with a pixel size of 160 ×160 can be obtained every second from Kinect's depth camera.

3.1.4 Guarantee the time consistency of the acquired signals

After the signals are handled as mentioned above, these three types of data are collected every 100 ms. As shown in Figure 5, the hand gesture

sEMG image, RGB image, and depth image acquired at the same time tare St,Rt,andDt, respectively:

6of12 GAO ET AL.

FIGURE 5 Data fusion process

⎧

⎪

⎨

⎪

⎩

St∈Rw×h×cs

Rt∈Rw×h×crgb

Dt∈Rw×h×cd,

(2)

where cs=1is the channel number of the sEMG image, crgb =3is the channel number of the RGB image, and cd=1is the channel number of

the depth image.

3.2 Data fusion

The way of data fusion is indicated in Figure 3. The depth image Dtand the sEMG image Stare sequentially attached to the RGB image Rtas

additional channels. The equation is as follows:

Λ∶(Rw×h×crgb ,Rw×h×cd,Rw×h×cs)→Rw×h×cf,(3)

where Ft=Λ(Rt,Dt,St),(4)

cf=crgb +cd+cs,(5)

where Ftis the merged image and cf=5is the number of channels. The three types of data collected at the same time can be converted into

fused data by M. The fused hand gesture image Ftcontains (a) performance features contained in RGB channels; (b) 3D space features contained

in a depth channel; and (c) myoelectric features contained in sEMG channel. Finally, Ftis input as the fused data into the deep neural network.

4MULTISCALE PARALLEL CONVOLUTIONAL NEURAL NETWORK (MPN)

Aiming at feature extraction and recognition of the hand gesture fusion data, we propose an MPN framework. This section mainly introduces

hand gesture database, network framework, and training method.

4.1 Hand gesture database

Because there is currently no public hand gesture database containing RGB images, depth images, and sEMG signals, we have to create a set of

such database containing the above 10 hand gestures. The Kinect sensor is used to capture the RGB and depth images of the hand gestures,

the MYO armband is used to capture the sEMG signal of the hand gestures, and the above data processing method is utilized to convert all

these three data into images with size 160 ×160. Hand gesture data of six subjects are collected, and each subject is collected 625 images for

each hand gesture. In addition, one of these subjects is collected with object occlusion. Therefore, our hand gesture database contains a total of

112,500 images, of which the number of RGB, depth, and sEMG images are all 37,500. Then, by combining these three kinds of images collected

at the same time using the above data fusion method, 37,500 images with a size of 160 ×160 and a channel number of 5 can be obtained.

4.2 Network framework

The proposed MPN frameworkis based on the following ideas: (a)In the current CNN models forimage feature extraction, the residualconnection

structure (He et al., 2016) can train deeper networks. It is implemented by using shortcut connections. Its building block is shown in Figure 4, and

the residual mapping formula is

F(x)=H(x)+x,(6)

where xis the unit map, H(x)is the optimal solution near the unit map and F(x)is the residual map between the unit map and the optimal solution.

GAO ET AL.7of12

FIGURE 6 Multiscale parallel

convolutional neural network framework

The inception structure Szegedy et al. (2016) can avoid computational explosion and extract features from multiple scales. So we add these

two structures to the MPN. The traditional Inception v4 or Inception-ResNet (Szegedy et al., 2016) models are mainly for large-sized (299 ×299)

images. Because of our small data size (160 ×160), the netw ork structure of the Inception v4 network is redesigned for using in the feature

extraction of our hand gesture recognition database. Inspired by Karpathy et al. (2014), the fusion of image information atdifferent spatial scales

can increase the recognition rate of images. Therefore, a parallel deep neural network structure is designed to fuse two kinds of image data

with spatial scales of 160 ×160 and 80 ×80. The specific network framework is shown in Figure 6, and its function modules are presented

in Figure 7.

As shown in Figure 4, the MPN framework is mainly divided into two channels: a high-resolution network (HRN) and a low-resolution network

(LRN). The input to the network is the fused data and downsampled data from the fused data, and the output is a probability vector of the 10

HRI hand gestures. The fused data are downsampled to obtain data with size 80 ×80 ×5, and the two channels process image data of these two

spatial scales in parallel. Among them, HRN mainly extracts and classifies hand gesture data with a spatial size of 160 ×160,andtheLRNmainly

extracts and classifies hand gesture data with a spatial size of 80 ×80. The structures of Stem module, Inception module, and Reduction module

in the MPN are the same as the structures of the corresponding modules in the Inception v4. PH(C|x,WH)is the classification result of HRN, and

PL(C|x,WL)is the classification result of LRN. Then, the two results are fused to obtain the final hand gesture classification result PF(C|x).The

fusion process uses the element-wise method. Its equation is

PF(C|x)=PH(C|x,WH)∗PL(C|x,WL),(7)

where PF(C|x),PH(C|x,WH),PL(C|x,WL)∈R10×1. The predicted label ch∗chooses the maximum value of the vector PF(C|x), its equation is

ch∗= argmaxPF(C|x).(8)

8of12 GAO ET AL.

FIGURE 7 Function modules. (a)

Reduction-A, (b) Reduction-B, (c)

Inception-A, (d) Inception-B, (e)

Inception-C

4.3 Train

Select negative log-likelihood (Norouzi et al., 2016) as the loss function. Its equation is

L(W,DH)=− 1

|DH|

∑

i=0

log(PF(C(i)|x(i),W)),(9)

where DHis the hand gesture database and Wis the weight function.

The training process uses the steep gradient descent, and its iterative function is shown as follows:

Vt+1=𝜇Vt−𝛼▽L(Wt),(10)

Wt+1=Wt+Vt+1,(11)

where tis the current number of iterations. The weight value Wt+1 of the t+1time depends on the weight Wtof the ttime and the weight

increment Vt+1 of the t+1time. The value of Vt+1 is updated by a linear c omparison of the last value Vtand negative gradient. 𝛼is the learning rate

of the gradient and 𝜇is the momentum of the last gradient value. We are required to adjust the values of 𝛼and 𝜇to get the best training results.

Adjust the value of the learning rate 𝛼by the step method (LeCun et al., 2015). Its equation is

𝛼=𝛼0×𝛾(t∕s).(12)

Among them, 𝛼0is the initial learning rate, 𝛾is the adjustment parameter, and srepresents the iteration length of the adjustment learning rate.

That is to say, when the current number of iterations reaches an integral multiple of s, the learning rate is adjusted.

5EXPERIMENTAL RESULTS AND DISCUSSION

The proposed data fusion method and MPN method are verified under the above hand gesture database, which proves the feasibility and

superiority of the proposed methods.

5.1 Verification of data fusion method

The data collected by a subject with occlusion in the hand gesture database are used as the verification data, and the data collected by the other

five subjects are used as the training data to verify the data fusion method. The RGB images, the depth images, the EMG signal images, and the

fused images of the hand gestures are separately trained and verified using the HRN, and the experimental results are compared.

Set the parameters during the training p roc ess. The values of the parameters 𝜇and 𝛼0are mainly based on experience. We set the value of

𝜇to 0.9 and the value of 𝛼0to 0.001. The value of 𝛾is set to 0.1, the value of the learning rate iteration length sis set to 20,000, and the total

number of training steps is 60,000. That is to say, when the number of training steps is 20,000 and 40,000, the learning rate becomes 0.0001

and 0.00001, respectively.

All experiments are performed in GTX1060, 6 GB memory, and the deep learning framework selects Tensorflow.

GAO ET AL.9of12

Input data Accuracy (%) Speed (ms) GPU

RGB 55.07 19 GTX1060

Depth 47.26 18 GTX1060

sEMG 75.52 18 GTX1060

RGB + Depth + sEMG 88.89 21 GTX1060

The bold emphasis are the highest accuracy and fastest speed.

TABLE 2 Accuracy and speed

FIGURE 8 The comparison of 10 hand

gesture recognition accuracy with four

different input data. The blue bar indicates the

accuracy of hand gesture recognition obtained

by inputting RGB images separately. The cyan

bar indicates the accuracy of hand gesture

recognition obtained by inputting depth

images separately. The yellow bar indicates

the accuracy of hand gesture recognition

obtained by inputting sEMG images

separately. And the red bar indicates the

accuracy of hand gesture recognition obtained

by inputting fusion images

Network Accuracy (%) Speed (ms) GPU

High-resolution network 88.89 21 GTX1060

Low-resolution network 87.32 16 GTX1060

Multiscale parallel convolutional neural network 92.45 32 GTX1060

The bold emphasis are the highest accuracy and fastest speed.

TABLE 3 Accuracy and speed

Through experiments, the average accuracies and time spent of hand gesture recognition using RGB, depth, sEMG data alone and using fused

data can be obtained as shown in Table 2. The recognition accuracy is calculated based on the ratio of the correct hand gesture images to the

total hand gesture images, and the average accuracy is the average of the 10 hand gesture accuracies. The accuracy comparison of the 10 hand

gestures corresponding to these four methods is shown in Figure 8.

It can be viewed in Table 2 that among the experimental results of the four different data, the accuracies of using only RGB and depth data

are very low (RGB: 55.07%, depth: 47.26%). This is because all the hand gesture images in the training process are unoccluded, but some of the

hand gesture images in the verified data are partially occluded or totally occluded. Therefore, it can be observed that in the case of occlusion, the

effect of using vision for hand gesture recognition is not ideal. Recognition accuracy of using sEMG data alone is 75.52%, which indicates that

the sEMG signal is less affected by hand gesture occlusion. However, as can be seen from Figure 5, the recognition accuracy of the hand gesture

8isonly6%, which may be because the sEMG signals of the hand gesture 8 (grip a ball) and the hand gesture 7 (grip a cylinder) are very close,

resulting in mo st of the sEMG imag es of the hand gesture 8 are recognized as hand gesture 7. Although these two hand gestures differ greatly

in RGB images and depth images and can be easily distinguished, recognition accuracy using the fusion data is the highest, reaching 88.89%.

And we can see from Figure 5 that the fusion data have a high recognition accuracy for each hand gesture, wherein the recognition accuracy of

hand gesture 7 is the lowest, reaching 69.38%, and the recognition accuracy of hand gesture 6 is the highest, reaching 100%. Therefore, it can be

proved that the use of fused data can effectively improve the recognition accuracy of hand gestures. In addition, as can be observed in Table 1,

the speed of using the fused data is the slowest in the four methods, reaching 21 ms, but it can also achieve high real-time performance.

5.2 Verification of MPN method

In order to verify the superiority of the proposed MPN method, we use the HRN, LRN, and MPN to train and verify the fused hand gesture data.

And the results are compared. In order to maintain the fairness of the method comparison, each training parameter is set to be consistent with

the above, and the average accuracy and speed of the hand gesture recognition obtained are shown in Table 3. The accuracy comparison of the

10 hand gestures corresponding to these three methods is shown in Figure 9.

As can be seen from Table 3, the MPN has the highest hand gesture recognition accuracy of 92.45%. And we can see from Figure 6 that

the recognition accuracies of all 10 hand gestures obtained by using the MPN are high, where the lowest hand gesture recognition accuracy is

80.85%for the hand gesture 7, and the highest hand gesture recognition accuracy for the hand gesture 4 and the hand gesture 6 are both 100%.

It is proposed that the MPN method can effectively improve the hand gesture recognition accuracy. In addition, it can be seen from Table 3

that the LRN method is the fastest, reaching 16 ms; this is because the parameters of the LRN network are relatively few. The MPN method is

the slowest, reaching 32 ms, but at this speed, the system can still achieve real-time performance. In a word, it is proved that the MPN method

10 of 12 GAO ET AL.

FIGURE 9 The comparison of 10 hand

gesture recognition accuracy with three

different network framewo rks. The grey bar

indicates the accuracy of hand gesture

recognition using only the HRN. The green

bar indicates the accuracy of hand gesture

recognition using only the LRN. And the

purple bar indicates the accuracy of hand

gesture recognition using the MPN. HRN,

high-resolution network; LRN, low-resolution

network; MPN, multiscale parallel

convolutional neural network

FIGURE 10 Hand gesture input and

manipulator motion state diagram. (a)

Hand gesture 1, (b) Hand gesture 2,

4, (e) Hand gesture 5, (f) Hand

gesture 6, (g) Hand gesture 7, (h)

Hand gesture 8, (i) Hand gesture 9, (j)

Hand gesture 10

TABLE 4 Accuracies of the manipulator corresponding to the

10 hand gestures Handgesturenumber12345678910

Accuracy(%) 96829798899778859091

proposed in this papercan not only effectively improve the recognition accuracy of hand gestures but also can be applied to the real-time control

system of the above-mentioned bionic manipulator.

5.3 Verification of MPN method

Applying the above MPN method to the recognition of the 10 HRI hand gestures can realize online recognition of these hand gestures.

Convert the recognized hand gesture into an action instruction, and then transmit the action instruction to the seven-degree-of-freedom bionic

manipulator according to the FSM model proposed above, this can realize the control of the bionic manipulator by the hand gesture operation.

The hand gesture input and the corresponding motion state corresponding to the bionic manipulator are shown in Figure 10.

In the real-time system, the movement of the bionic manipulator is controlled by hand gestures, and the above 10 kinds of hand gestures are

collected 100 times by Kinect and MYO band, and the manipulator response action is recorded in each time, and then the response accuracies of

the manipulator corresponding to the 10 hand gestures are obtained and which are shown in Table 4.

It can be seen from Table 2 that the accuracies of the manipulator operation corresponding to the above 10 HRI hand gestures are more than

or equalto 78%. Among them, the accuracy of hand gesture 7 is the lowest (78%), which corresponds to the lowest recognition accuracy of hand

gesture 7 in Figure 6. Hand gesture 4 has the highest accuracy of 98%, which corresponds to the highest recognition accuracy of hand gesture

6 in Figure 6. The average accuracy of the manipulator operation is 90.3%, which is 2.15% lower than the hand gesture recognition accuracy of

92.45% by using MPN. This explain that there will be an error of about 2.15% from hand gesture recognition to the manipulator response, which

is accep table in practice. In addition, through experimental records, it is found that almost all erroneous response o perations keep the last ac tio n,

indicating that the proposed FSM method is helpful for stability in the manipulator interaction control. In this case, even if the hand gesture

recognition is wrong, as long as the manipulator keeps the motion unchanged, it will not be affected by the operation error. Therefore, this paper

proves that the proposed hand gesture recognition method and the HRI method are effective and superior to gesture-based robot control.

GAO ET AL.11 of 12

6CONCLUSION REMARK AND FUTURE WORK

In this paper, focus on the gesture-based HRI for a bionic manipulator on the AAR, a method using data fusion and multiscale parallel neural

network is proposed to improve the recognition accuracy o f 10 HRI hand gestures. The contributions and innovation s of this paper are summarized

as follows: (a) For the control of the seven-degree-of-freedom bionic manipulator, 10 commonly used HRI hand gestures are designed, and a

corresponding hand gesture database is made for these 10 hand gestures. The database contains RGB, depth, and sEMG data. (b) A data fusion

method is proposed to fuse RGB, depth, and sEMG signals with different scales to achieve consistency of these three data sizes and sampling

time. (c) A multiscale parallel convolutional neural network framework is proposed to improve the recognition accuracy of hand gestures.

In the next step, our research will be performed on dynamic hand gestures. Because the recognition of dynamic hand gestures is more practical,

the data fusion method and recognition are more difficult. In the future, the method proposed in this paper needs to be improved to make it

applicable to the recognition task of dynamic hand gestures.

ACKNOWLEDGEMENT

National Key R&D Program of China, Grant 2018YFB1304600; Natural Science Foundation of China, Grants 51775541 and 51575412; CAS

Interdisciplinary Innovation Team, Grant JCTD-2018-11; the EU Seventh Framework Programme (FP7)-ICT, Grant 611391.

CONFLICT OF INTERESTS

The authors declare no conflict of interests.

ORCID

Qing Gao https:// orcid.org/0000-0002-5695- 7405

Jinguo Liu https:// orcid.org/0000- 0002- 6790- 6582

Zhaojie Ju https://orcid.org/0000- 0002-9524-7609

REFERENCES

Benalcázar, M. E., Jaramillo, A. G., Zea, A., Páez, A., Andaluz, V. H., et al. (2017). Hand gesture recognition using machine learning and the myo armband. In

2017 25th European Signal Processing Conference (EUSIPCO), IEEE, pp. 1040–1044.

Boyali, A., Hashimoto, N., & Matsumoto, O. (2015). Hand posture and gesture recognition using MYO armband and spectral collaborative representation

based classification. In 2015 IEEE 4th Global Conference on Consumer Electronics (GCCE), IEEE, pp. 200–201.

Chen, C., Jafari, R., & Kehtarnavaz, N. (2015). Utd-mhad: A multimodal dataset for humanaction recognition utilizing adepth camera and awearable inertial

sensor. In 2015 IEEE International conference on image processing (ICIP), IEEE, pp. 168–172.

EL-SAYED, A. (2015). Multi-biometric systems: A state of the art survey and research directions. IJACSA) International Journalo f Advanced Computer Science

and Applications,6.

Gao, Q., Liu, J., Ju, Z., Li, Y., Zhang, T., & Zhang, L. (2017a). Static hand gesture recognition with parallel CNNs for space human-robot interaction. In

International Conference on Intelligent Robotic s and Applications, Springer, pp. 462–473.

Gao, Q., Liu, J., Ju, Z., & Zhang, X. (2019). Dual-hand detection for human-robot interaction by a parallel network based on hand detection and body pose

estimation. IEEE Transactions on Industrial Electronics.

Gao, Q., Liu, J., Tian, T., & Li, Y. (2017b). Free-flying dynamics and control of an astronaut assistant robot based on fuzzy sliding mode algorithm. Acta

Astronautica,138, 462–474.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residuallearning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern

recognition, pp. 770–778.

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neuralnetworks.In

Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732.

Kopuklu, O., Kose, N., & Rigoll, G. (2018). Motion fused frames: Data level fusion strategy forhand gesture recognition. In Proceedings of the IEEEConference

on Computer Vision and Pattern Recognition Workshops, pp. 2103–2111.

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature,521(7553), 436.

Liu, K., Chen, C., Jafari, R., & Kehtarnavaz, N. (2014). Fusion of inertial and depth sensor data for robust hand gesture recognition. IEEE Sensors Journal,

14(6), 1898–1903.

Liu, J., Gao, Q., Liu, Z., & Li, Y. (2016a). Attitude control for astronaut assisted robot in the space station. International Journal of Control, Automation and

Systems,14(4), 1082–1095.

Liu, J., Luo, Y., & Ju, Z. (2016b). An interactive astronaut-robot system with gesture control. Computat ional intelligence and neuroscience,2016.

Malima, A. K., Özgür, Erol, & Çetin, Müjdat (2006). A fast algorithm for vision-based hand gesture recognition for robot control.

Miao, Q., Li, Y., Ouyang, W., Ma, Z., Xu, X., Shi, W., & Cao, X. (2017). Multimodal gesture recognition based on the resc3d network. In Proceedings of the

IEEE Inte rnational Conference on Computer Vision, pp. 3047–3055.

Molchanov, P., Gupta, S., Kim, K., & Kautz, J. (2015). Hand gesture recognition with 3D convolutional neural networks. In Proceedings of the IEEE conference

on computer vision and pattern recognition workshops, pp. 1–7.

12 of 12 GAO ET AL.

Norouzi, M., Bengio, S., Jaitly, N., Schuster, M., Wu, Y., Schuurmans, D., et al. (2016). Reward augmented maximum likelihood for neural structured

prediction. In Advances In Neural Information Processing Systems, pp. 1723–1731.

Raheja, J. L., Shyam, R., Kumar, U., & Prasad, PBhanu (2010). Real-time robotic hand control using hand gestures. In 2010 Second International Conference

on Machine Learning and Computing, IEEE, pp. 12–16.

Rautaray, SiddharthS, & Agrawal, A. (2015). Vision based hand gesture recognition for human computer interaction: A survey. Artificial int elligence review,

43(1), 1–54.

Ren, Z., Yuan, J., Meng, J., & Zhang, Z. (2013). Robust part-based hand gesture recognition using kinect sensor. IEEE transactions on multimedia,15(5),

1110–1120.

Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in neural informatio n processing

systems, pp. 568–576.

Smith, A. V. W., Sutherland, A. I., Lemoine, A., & Mcgrath, S. (2000). Hand gesture recognition system and method. US Patent 6,128,003.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE

conference on computer vision and pattern recognition, pp. 2818–2826.

AUTHOR BIOGRAPHIES

Qing Gao was born in Tangshan, China. He received his BS degree in automation from Electrical Engineering and Automation School,

Liaoning Technology University, China, in 2013. Currently, he is working towards his PhD degree in the State Key Laboratory of Robotics,

Shenyang Institute of Automation (SIA), Chinese Academy of Sciences (CAS), Shenyang, China. His research interests include space robot,

artificial intelligence, machine vision, and human–robot interaction. He has authored or coauthored over 10 publications in journals and

conference proceedings in above areas and received one outstanding paper award.

Jinguo Liu received his PhD degree in robotics from Shenyang Institute of Automation (SIA), Chinese Academy of Sciences (CAS)in 2007.

His research interests include modular robot, rescue robot, space robot, and bio-inspired robot. Since January 2011, he has been a Full

Professor with SIA, CAS. He also holds the Assistant Director position of State Key Laboratory of Robotics (China) from March 2008.

He has authored and coauthored over 80 papers and 30 patents in above areas. He is a member of IEEE, a Senior Member of China

Mechanical Engineering Society, and the lead guest editor of International Journal of Advances in Mechanical Engineering.

Zhaojie Ju received his BS in automatic control and his MS in intelligent robotics both from Huazhong University of Science and

Technology, China, in 2005 and2007, respectively, andhis PhD degree in intelligent robotics atthe University of Portsmouth, UK, in 2010.

His research interests include machine intelligence, pattern recognition, and their applications on human motion analysis, human–robot

interaction and collaboration, and robot skill learning. He is currently a Senior Lecturer in the School of Computing, University of

Portsmouth. He previously held research appointments atUniversity College London and University of Portsmouth, UK. He is an Associate

Editor of the IEEE Transactions on Cybernetics. He has authored or coauthored over 100 publications in journals, book chapters, and

conference proceedings and received four best paper awards.

How to cite this article: Gao Q, Liu J, Ju Z. Hand gesture recognition using multimodal data fusion and multiscale parallel convolutional

neural network for human–robot interaction. Expert Systems. 2020;e12490. https:// doi.org/10.1111/exsy.12490

A preview of this full-text is provided by Wiley.

Learn more

Content available from Expert Systems

This content is subject to copyright. Terms and conditions apply.

An early force prediction control scheme using multimodal sensing of electromyography and digit force signals

Article

Full-text available

Apr 2024

Different grasping gestures result in the change of muscular activity of the forearm muscles. Similarly, the muscular activity changes with a change in grip force while grasping the object. This change in muscular activity, measured by a technique called Electromyography (EMG) is used in the upper limb bionic devices to select the grasping gesture. Previous research studies have shown gesture classification using pattern recognition control schemes. However, the use of EMG signals for force manipulation is less focused, especially during precision grasping. In this study, an early predictive control scheme is designed for the efficient determination of grip force using EMG signals from forearm muscles and digit force signals. The optimal pattern recognition (PR) control schemes are investigated using three different inputs of two signals: EMG signals, digit force signals and a combination of EMG and digit force signals. The features extracted from EMG signals included Slope Sign Change, Willison Amplitude, Auto Regressive Coefficient and Waveform Length. The classifiers used to predict force levels are Random Forest, Gradient Boosting, Linear Discriminant Analysis, Support Vector Machines, k-nearest Neighbors and Decision Tree. The two-fold objectives of early prediction and high classification accuracy of grip force level were obtained using EMG signals and digit force signals as inputs and Random Forest as a classifier. The earliest prediction was possible at 1000 ms from the onset of the gripping of the object with a mean classification accuracy of 90 % for different grasping gestures. Using this approach to study, an early prediction will result in the determination of force level before the object is lifted from the surface. This approach will also result in better biomimetic regulation of the grip force during precision grasp, especially for a population facing vision deficiency.

Improved Network and Training Scheme for Cross-Trial Surface Electromyography (sEMG)-Based Gesture Recognition

Article

Full-text available

Sep 2023

To enhance the performance of surface electromyography (sEMG)-based gesture recognition, we propose a novel network-agnostic two-stage training scheme, called sEMGPoseMIM, that produces trial-invariant representations to be aligned with corresponding hand movements via cross-modal knowledge distillation. In the first stage, an sEMG encoder is trained via cross-trial mutual information maximization using the sEMG sequences sampled from the same time step but different trials in a contrastive learning manner. In the second stage, the learned sEMG encoder is fine-tuned with the supervision of gesture and hand movements in a knowledge-distillation manner. In addition, we propose a novel network called sEMGXCM as the sEMG encoder. Comprehensive experiments on seven sparse multichannel sEMG databases are conducted to demonstrate the effectiveness of the training scheme sEMGPoseMIM and the network sEMGXCM, which achieves an average improvement of +1.3% on the sparse multichannel sEMG databases compared to the existing methods. Furthermore, the comparison between training sEMGXCM and other existing networks from scratch shows that sEMGXCM outperforms the others by an average of +1.5%.

SimCLR-Inception: An Image Representation Learning and Recognition Model for Robot Vision

Chapter

Nov 2023

Effective feature extraction is a key component in image recognition for robot vision. This paper presents an improved contrastive learning-based image feature extraction and classification model, termed SimCLR-Inception, to realize effective and accurate image recognition. By using the SimCLR, this model generates positive and negative image samples from unlabeled data through image augmentation and then minimizes the contrastive loss function to learn the image representations by exploring more underlying structure information. Furthermore, this proposed model uses the Inception V3 model to classify the image representations for improving recognition accuracy. The SimCLR-Inception model is compared with four representative image recognition models, including LeNet, VGG16, Inception V3, and EfficientNet V2 on a real-world Multi-class Weather (MW) data set. We use four representative metrics: accuracy, precision, recall, and F1-Score, to verify the performance of different models for image recognition. We show that the presented SimCLR-Inception model achieves all the successful runs and gives almost the best results. The accuracy is at least \(4\%\) improved by the Inception V3 model. It suggests that this model would work better for robot vision.

Hand Gesture Recognition For Human-Robot Cooperation In Manufacturing Applications

Conference Paper

Jun 2023

Stanisław Hożyń

Human-robot cooperation plays an increasingly important role in manufacturing applications. Together, humans and robots display an exceptional skill level that neither can achieve independently. For such cooperation, hand gesture communication using computer vision has been proven to be the most suitable due to the low cost of implementation and flexibility. Therefore, this work focuses on the hand gesture classification problem in view of human and robot collaboration. To facilitate collaboration, six of the most common gestures applicable in manufacturing applications were selected. The first part of the research was devoted to creating an image dataset using the proposed acquisition system. Then, pre-trained neural networks were designed and tested. In this step, the feature extraction approach was adopted, which utilises the representations learned by a previous network to extract meaningful features. The results suggest that all developed pre-trained networks attained high accuracy (above 98,9%). Among them, the VGG19 demonstrated the best performance, achieving accuracy equal to 99,63%. The proposed approach can be easily adapted to recognise a more extensive or different set of gestures. Utilising the proposed vision system and the developed neural network architectures, the adaptation demands only acquiring a set of images and retraining the developed networks.

A robot-based surveillance system for recognising distress hand signal

Article

May 2024

Unfortunately, there are still cases of domestic violence or situations where it is necessary to call for help without arousing the suspicion of the aggressor. In these situations, the help signal devised by the Canadian Women’s Foundation has proven to be effective in reporting a risky situation. By displaying a sequence of hand signals, it is possible to report that help is needed. This work presents a vision-based system that detects this sequence and implements it in a social robot, so that it can automatically identify unwanted situations and alert the authorities. The gesture recognition pipeline presented in this work is integrated into a cognitive architecture used to generate behaviours in robots. In this way, the robot interacts with humans and is able to detect if a person is calling for help. In that case, the robot will act accordingly without alerting the aggressor. The proposed vision system uses the MediaPipe library to detect people in an image and locate the hands, from which it extracts a set of hand landmarks that identify which gesture is being made. By analysing the sequence of detected gestures, it can identify whether a person is performing the distress hand signal with an accuracy of 96.43%.

Commodity Demand Forecasting Based on Multimodal Data and Recurrent Neural Networks for E-commerce Platforms

Article

Mar 2024

Cunbing Li

An Accurate Estimation of Hand Gestures Using Optimal Modified Convolutional Neural Network

Article

Sep 2024
EXPERT SYST APPL

A versatile interaction framework for robot programming based on hand gestures and poses

Article

Dec 2023
ROBOT CIM-INT MANUF

3D Human Pose Estimation in Video for Human-Computer/Robot Interaction

Chapter

Oct 2023

3D human pose estimation is widely used in motion capture, human-computer interaction, virtual character driving and other fields. The current 3D human pose estimation has been suffering from depth blurring and self-obscuring problems to be solved. This paper proposes a human pose estimation network in video based on a 2D lifting to 3D approach using transformer and graph convolutional network(GCN), which are widely used in natural language processing. We use transformer to obtain sequence features and use graph convolution to extract features between local joints to get more accurate 3D pose coordinates. In addition, we use the proposed 3D pose estimation network for animated character motion generation and robot motion following and design two systems of human-computer/robot interaction (HCI/HRI) applications. The proposed 3D human pose estimation network is tested on the Human3.6M dataset and outperforms the state-of-the-art models. Both HCI/HRI systems are designed to work quickly and accurately by the proposed 3D human pose estimation method.

Virtual Control System for Presentations by Real-Time Hand Gesture Recognition Based on Machine Learning

Conference Paper

Sep 2023

Presentations are a powerful tool for presenters who want to persuade their audiences in today's digital age. This paper exploits advances in hand gesture recognition, and proposes a virtual control system for presentations. The proposed system utilizes a webcam or built-in camera to capture hand gestures. Based on hand gestures, presentations can be controlled virtually and change presentation slides in both forward and backward directions. It is also possible by using the proposed system to get a pointer on the slide, write, or draw virtually on the screen through specific hand gestures. The obtained results show that the proposed system has a high accuracy of 96% in recognizing hand gestures and thus controlling presentations remotely without using any external device.

Dual-Hand Detection for Human–Robot Interaction by a Parallel Network Based on Hand Detection and Body Pose Estimation

Article

Full-text available

Feb 2019

In this study, a parallel network based on hand detection and body pose estimation is proposed to detect and distinguish human's right and left hands. The network is used for human-robot interaction (HRI) based on hand gestures. This method fully uses the hand feature information and hand information in the human body structure. One channel in the network uses a ResNet-Inception-Single Shot MultiBox Detector to extract hand feature information for human's hand detection. The other channel estimates human body pose first and then estimates the positions of the left and right hands using the forward kinematic tree of the human skeleton structure. Thereafter, the results of the two channels are fused. In the fusion module, the human body structure can be used to correct hand detection results and distinguish between the right and left hands. Experimental results verify that the parallel deep neural network can effectively improve the accuracy of hand detection and distinguish between the right and left hands effectively. In addition, this method is used for the hand gesture-based interaction between astronauts and an astronaut assistant robot. Our method can be suitably used in this HRI system.

Motion Fused Frames: Data Level Fusion Strategy for Hand Gesture Recognition

Conference Paper

Full-text available

Apr 2018

Acquiring spatio-temporal states of an action is the most crucial step for action classification. In this paper, we propose a data level fusion strategy, Motion Fused Frames (MFFs), designed to fuse motion information into static images as better representatives of spatio-temporal states of an action. MFFs can be used as input to any deep learning architecture with very little modification on the network. We evaluate MFFs on hand gesture recognition tasks using three video datasets - Jester, ChaLearn LAP IsoGD and NVIDIA Dynamic Hand Gesture Datasets - which require capturing long-term temporal relations of hand movements. Our approach obtains very competitive performance on Jester and ChaLearn benchmarks with the classification accuracies of 96.28% and 57.4%, respectively, while achieving state-of-the-art performance with 84.7% accuracy on NVIDIA benchmark.

Multimodal Gesture Recognition Based on the ResC3D Network

Conference Paper

Full-text available

Oct 2017

Static Hand Gesture Recognition with Parallel CNNs for Space Human-Robot Interaction

Conference Paper

Full-text available

Aug 2017

As a new type of human-robot interaction (HRI), hand gesture has many advantages such as natural operation, rich expression and not subject to environ-mental constraints. So it is very suitable for space human-robot interaction tasks in special and harsh environment. Considering that static hand gesture is one of the main gesture expressions in human-computer interaction, so a parallel convolution neural networks (CNNs) is designed to improve the accuracy of static hand gesture recognition in the conditions of complex background and changing illumination. In addition, the method is applied to the operation of space human-robot system with hand gesture control. Various space HRI hand gestures from different subjects are evaluated and tested, and experimental results demonstrate that the proposed method outperforms the single-channel CNN methods and other popular methods with a higher accuracy.

Free-flying dynamics and control of an astronaut assistant robot based on fuzzy sliding mode algorithm

Article

Full-text available

Jun 2017
ACTA ASTRONAUT

Space robots can perform some tasks in harsh environment as assistants of astronauts or substitutions of astronauts. Taking the limited working time and the arduous task of the astronauts in the space station into account, an astronaut assistant robot (AAR-2) applied in the space station is proposed and designed in this paper. The AAR-2 is achieved with some improvements on the basis of AAR-1 which was designed before. It can exploit its position and attitude sensors and control system to free flight or hover in the space cabin. And it also has a definite environmental awareness and artificial intelligence to complete some specified tasks under the control of astronauts or autonomously. In this paper, it mainly analyzes and controls the 6-DOF motion of the AAR-2. Firstly, the system configuration of AAR-2 is specifically described, and the movement principles are analyzed. Secondly, according to the physical model of the AAR-2, the Newton - Euler equation is applied in the preparation of space dynamics model of 6-DOF motion. Then, according to the mathematical model's characteristics which are nonlinear and strong coupling, a dual closed loop position and attitude controller based on fuzzy sliding mode control is proposed and designed. Finally, simulation experiments are appropriate to provide for AAR-2 control system by using Matlab/Simulink. From the simulation results it can be observed that the designed fuzzy sliding mode controller can control the 6-DOF motion of AAR-2 quickly and precisely.

Hand gesture recognition using machine learning and the Myo armband

Conference Paper

Aug 2017

Deep Residual Learning for Image Recognition

Conference Paper

Jun 2016

Hand gesture recognition with 3D convolutional neural networks

Conference Paper

Jun 2015

Reward Augmented Maximum Likelihood for Neural Structured Prediction

Article

Sep 2016

A key problem in structured output prediction is direct optimization of the task reward function that matters for test evaluation. This paper presents a simple and computationally efficient approach to incorporate task reward into a maximum likelihood framework. We establish a connection between the log-likelihood and regularized expected reward objectives, showing that at a zero temperature, they are approximately equivalent in the vicinity of the optimal solution. We show that optimal regularized expected reward is achieved when the conditional distribution of the outputs given the inputs is proportional to their exponentiated (temperature adjusted) rewards. Based on this observation, we optimize conditional log-probability of edited outputs that are sampled proportionally to their scaled exponentiated reward. We apply this framework to optimize edit distance in the output label space. Experiments on speech recognition and machine translation for neural sequence to sequence models show notable improvements over a maximum likelihood baseline by using edit distance augmented maximum likelihood.

Rethinking the Inception Architecture for Computer Vision

Conference Paper

Jun 2016

Convolutional networks are at the core of most stateof-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 17.3% top-1 error.

Hand gesture recognition using multimodal data fusion and multiscale parallel convolutional neural network for human–robot interaction

Abstract and Figures

Recommended publications

A Fast-Response Dynamic-Static Parallel Attention GCN Network for Body–Hand Gesture Recognition in H...

Robust real-time hand detection and localization for space human-robot interaction based on deep lea...

Hand Detection and Location Based on Improved SSD for Space Human-Robot Interaction: 11th Internatio...

A Two-Stream CNN Framework for American Sign Language Recognition Based on Multimodal Data Fusion

Dual-Hand Detection for Human–Robot Interaction by a Parallel Network Based on Hand Detection and Bo...