ArticlePDF Available

Hand gesture recognition using multimodal data fusion and multiscale parallel convolutional neural network for human–robot interaction

Authors:

Abstract and Figures

Hand gesture recognition plays an important role in human–robot interaction. The accuracy and reliability of hand gesture recognition are the keys to gesture‐based human–robot interaction tasks. To solve this problem, a method based on multimodal data fusion and multiscale parallel convolutional neural network (CNN) is proposed in this paper to improve the accuracy and reliability of hand gesture recognition. First of all, data fusion is conducted on the sEMG signal, the RGB image, and the depth image of hand gestures. Then, the fused images are generated to two different scale images by downsampling, which are respectively input into two subnetworks of the parallel CNN to obtain two hand gesture recognition results. After that, hand gesture recognition results of the parallel CNN are combined to obtain the final hand gesture recognition result. Finally, experiments are carried out on a self‐made database containing 10 common hand gestures, which verify the effectiveness and superiority of the proposed method for hand gesture recognition. In addition, the proposed method is applied to a seven‐degree‐of‐freedom bionic manipulator to achieve robotic manipulation with hand gestures.
This content is subject to copyright. Terms and conditions apply.
Received: 29 May 2019 Revised: 8 July 2019 Accepted: 4 October 2019
DOI: 10.1111/exsy.12490
SPECIAL ISSUE PAPER
Hand gesture recognition using multimodal data fusion and
multiscale parallel convolutional neural network for
human–robot interaction
Qing Gao1,2 Jinguo Liu1Zhaojie Ju1,3
1State Key Laboratory of Robotics, Shenyang
Institute of Autom ation, Institutes f or Robotics
and Intelligent Manufacturing, Chinese
Academy of Sciences, Shenyang, China
2University of Chinese Academy of Sciences,
Beijing, China
3School of Computing, University of
Portsmouth, Portsmouth, UK
Corres pon den ce
Jinguo Liu, State Key Laboratory of Robotics,
Shenyang Institute of Automation, Institutes
for Robo tics and Intelligent Manufacturing,
Chinese Academy of Sciences, Shenyang
110016, China.
Email: liujinguo@sia. cn
Funding information
CAS Interdisciplinary Innovation Team,
Grant/Award Number: JCTD-2018-11;
National Key R&D Program of China,
Grant/Award Number: 2018YFB1304600;
Natural Science Foundation of China,
Grant/Award Number: 51575412 and
51775541; the EU Seventh Framework
Programme (FP7)-ICT, Grant/Award Number:
611391
Abstract
Hand gesture recognition plays an important role in human–robot interaction. The accuracy and
reliability of hand gesture recognition are the keys to gesture-based human–ro bot interaction
tasks. To solve this problem, a method based on multimodal data fusion and multiscale parallel
convolutional neural network (CNN) is proposed in this paper to improve the accuracy and
reliability of hand gesture recognition. First of all, data fusion is conducted on the sEMG signal,
the RGB image, and the depth image of hand gestures. Then, the fused images are generated to
two differe nt sc ale images by downsampling, whic h are respectively input into two subnetworks
of the parallel CNN to obtain two hand gesture recognition results. After that, hand gesture
recognition results of the parallel CNN are combined to obtain the final hand gesture recognition
result. Finally, experiments are carried out on a self-made database containing 10 common
hand gestures, which verify the effectiveness and superiority of the proposed method for hand
gesture recognition. In addition, the proposed method is applied to a seven-degree-of-freedom
bionic manipulator to achieve robotic manipulation with hand gestures.
KEYWORDS
hand gesture recognition, multimodal data fusion, parallel CNN, sEMG signal
1INTRODUCTION
In gesture-based space human–robot interaction (HRI; Malima et al., 2006; Raheja et al., 2010), reliability and security are the key to ensuring the
normal operation of HRI (Liu et al., 2016b). The traditional methods of acquiring hand gesture information mainly include collecting RGB images
of hand gestures with a colour camera, collecting depth information of hand gestures with a depth sensor, and collecting electromyography
information of hand gestures with a sEMG device. Each of these methods has its advantages and disadvantages. For example, the RGB image
of the hand gesture has rich performance features, but it cannot show the 3D information of the hand gesture. The depth image of the hand
gesture contains 3D features, but it lacks sufficient performance features. Moreover, the RGB image and the depth image cannot be recognized
at the most of time in the case where the hand is severely blocked. Although the use of SEMG device does not need to consider the occlusion
problem of gestures, but the noise and interference of the device are large, and it is often impossible to obtain a high hand gesture recognition
accuracy. Various information about the hand gesture can be utilized, and the recognition accuracy of the hand gesture can be improved by
fusing multimodal gesture information. For example, Chen et al. (2015) combine depth camera data and inertial sensor data for the recognition of
27 human body movements, which improves the recognition accuracy. Miao et al. (2017) combine RGB, depth, and optical flow information to
identify dynamic gestures of the human upper bodies. Kopuklu et al. (2018) combine the colour map and the optical flow graph of dynamic hand
gestures and achieve good results on the Jester, ChaLearn LAP IsoGD and NVIDIA Dynamic Hand Gesture Datasets. For the way of multimodal
data fusion, according to the stage of fusion, it is mainly divided into data level fusion (Kopuklu et al., 2018), feature level fusion (Miao et al.,
2017), and decision level fusion (Simonyan & Zisserman, 2014). Among them, data level fusion can achieve the highest fusion efficiency. It has
Expert Systems. 2020;e12490. wileyonlinelibrary.com/journal/exsy © 2020 John Wiley & Sons, Ltd. 1of12
https:// doi.org/10.1111/exsy.12490
2of12 GAO ET AL.
two advantages: (a) training only requires a single-channel network and (b) automatically establish pixel-wise correspondence between different
modalities. Built on the above analysis, the data level fusion of RGB information, depth information, and sEMG information is proposed to
improve the recognition accuracy of hand gestures in this paper.
At present, work has achieved great achievement in the field of image recognition (Gao et al., 2017a). In addition, the use of multiscale
input in parallel networks can also improve the recognition accuracy of images effectively. For example, Karpathy et al. (2014) prove that
better experimental results can be obtained by a dual-stream convolutional neural network (CNN) with raw and spatially clipped video streams.
In Molchanov et al. (2015), a parallel 3DCNN is designed, and the original data and the data after downsampling are input into two parallel
subnetworks respectively, which realizes the improvement of the dynamic hand gesture recognition accuracy. Take into account these, a parallel
CNN structure is proposed. And the fused data and the downsampled data are taken as input to the two parallel subnetworks. Finally, the output
results are brought to gether to obtain the final hand gesture recognition ac curacy.
The contributions of this paper are concluded as follows.
(a) The RGB image, the depth image, and the SEMG signal of hand gesture are combined to deal with hand gesture recognition in the case
of hand occlusion and to improve the reliability and safety of gesture-based HRI.
(b) A data-level fusion method is designed to convert the SEMG signal into an image and then fuse it with the RGB image and the depth image.
(c) A multiscale parallel convolutional neural network (MPN) framework is designed to improve the recognition accuracy of hand gestures.
(d) A set of hand gesture database containing 10 common HRI hand gestures is made. This database contains RGB images, depth images,
and sEMG information, which can be used for verification of the proposed method.
(e) The proposed method is applied to the control of a seven-degree-of-freedom bionic manipulatorto realize gesture-based space HRI.
The rest of this paper is structured as follows. Section 2 introduces the gesture-based space HRI system. Section 3 presents the data fusion
method. Section 4 introduces the MPN framework. Section 5 is the experimental results and discussion, and the conclusion remark and future
work are adopted in Section 6.
2GESTURE-BASED SPACE HRI SYSTEM
Security and reliability play important roles in space HRI tasks. This chapter is aimed at a bionic manipulator on an astronaut assistant robot (AAR;
Gao et al., 2017b; Liu et al., 2016a), using hand gestures to control and operate it.
2.1 Space robot
In the space station, due to the heavy workload and the limited number of astronauts, the AAR has an obligation to assist the astronauts in
completing some space missions. For example, it can assist astronauts in conducting space experiments, help astronauts take some tools, and
manage the safety of astronauts. Therefore, we design an AAR for using in the space station cabin. As shown in Figure 1, its platform consists of
an AAR, a bionic manipulator, and a simulated air bearing table.
2.1.1 AAR
This space robot is capable of free flight in the cabin and is equipped with 12 ducted fans as its drives for six-degree-of-freedom motion in
microgravity environments.
2.1.2 Air bearing table
In the ground experiment, the simulated air bearing table can help the AAR to realize simulated microgravity movement in the horizontal direction.
2.1.3 Bionic manipulator
A bionic manipulator is mounted on the AAR, and it can be utilized to grab some tools or objects. Its structure consists of fingers, palm, and wrist
with seven degrees of freedom (five degrees of freedom in fingers and two degrees of freedom in the wrist). It is extremely important to control
the motion of the bionic manipulator. Traditional control methods, such as handles, joysticks, and consoles, are complicated and inconvenient to
operate. Because the structure of the manipulator is similar to a human hand, it is very convenient to directly control it with hand gestures.
2.2 Gesture-based HRI
Hand gesture recognition technology is very important in gesture-based HRI (Gao et al., 2019). Current hand gesture recognition methods are
mainly based on wearable devices and vision (Rautaray & Agrawal, 2015; Smith et al., 2000). Both methods have their pros and cons. For example,
wearable-based hand gesture recognition is limited by the device, and the interference is large. Although vision-based hand gesture recognition
is susceptible to occlusion, security and reliability are the priority in space missions. Therefore, in this paper, these two methods are combined to
improve the recognition accuracy of hand gestures and to cope with various interference, such as occlusion and signal interference.
GAO ET AL.3of12
FIGURE 1 Astronaut assistant robot platform
FIGURE 2 The pipeline of
human–robot interaction method
For the acquisition of hand gesture signals, we use the MYO armband to collect the sEMG signals of hand gestures (Benalcázar et al., 2017).
The sensors on the MYO band capture the bioelectrical changes that occur when the user's arm muscles move, thus can judge the wearer's hand
gestures, and then send the recognition results to the robot via Bluetooth. And use the Kinect to collect the RGB and depth images of hand
gestures (Ren et al., 2013). Kinect is a 3D sensor that includes a colour camera and a TOF depth camera. It can capture hand gesture movements
in 3D space in real time and then combine the three kinds of information and transmit them to the hand gesture recognition model based on deep
neural network to identify the corresponding hand gestures. After that, the recognized result is transmitted to the AAR, thereby enabling the
human hand to control the bionic manipulator so that it can simulate the human hand gestures. The specific method flow chart is shown in Figure 2.
2.3 HRI hand gesture dataset
Because the bionic manipulator has seven degrees of freedom, it can simulate the movement of human fingers and wrist; 10 common static hand
gestures are designed, including the gesture of the fingers or the gesture of fingers and wrist. These hand gestures and their semantics are shown
in Table 1. Among these hand gestures, hand gesture 1 and hand gesture 2 can control the initial action and stop action of the manipulator.
Hand gestures 3–6 can control the movement direction of the manipulator, that is, left, right, upward, and downward. The manipulator can learn
different ways of grasping from hand gesture 7 to hand gesture 10, which help to grasp different objects. By simulating these hand gestures, the
manipulator can perform some conventional directional motion operations and grab operations.
2.4 Stability analysis of HRI
For a complete gesture-based HRI system, the stability of the system should also be considered. When astronaut hand gestures are incorrect
or at the transition process between different hand gestures, or when the system just starts and the hand gesture changes suddenly, the HRI
system may be unstable. Therefore, it is necessary in order to design a method to map astronaut's hand gestures to the bionic manipulator to
filter out unstable hand gesture information and to use stable hand gesture output for controlling the manipulator.
4of12 GAO ET AL.
TABLE 1 Human–robot interaction hand gesture dataset Hand gesture number Hand gesture semantic Hand gesture diagram
Hand gesture 1 Relax
Hand gesture 2 Fist
Hand gesture 3 Left
Hand gesture 4 Right
Hand gesture 5 Downward
Hand gesture 6 Upward
Hand gesture 7 Grab a cylinder
Hand gesture 8 Grab a ball
Hand gesture 9 Pinch
Hand gesture 10 Buckle
FIGURE 3 Hand gesture interaction flow
chart
Finite state machine (FSM) is a mathematical model employed to represent finite states and transitions and actions between these states. It is
primarily used for the parsing of programming languages. The gesture-based HRI can also be taken into account as a language form that expresses
semantic commands through hand gestures. Therefore, it is very suitable to use FSM to model the semantics of different hand gestures.
For each predefined hand gesture, a FSM model needs to be created. As shown in Figure 3, in the process of interaction between the astronaut
and the bionic manipulator, first, each frame data containing a hand gesture is collected by Kinect and MYO. Second, the hand gesture recognition
algorithm is utilized to recognize the type of the hand gesture, and then the processed hand gesture information is input to the corresponding
FSM model. This can output the predefined hand gesture.
Taking hand gesture 1 as an example, the state transition relationship of the hand gesture model is established by using the FSM method, and
it is illustrated in Figure 4. The implementation principle of hand gesture 1 FSM model is shown as follows: The initial time of the system is in
state 1. At this time, the bionic manipulator is not controlled. When the astronaut enters a hand gesture and the system recognizes that the hand
gesture is ‘‘Hand gesture 1,’’ it transitions to transition state 2 and clears counter 1 and counter 2. At this time, the bionic manipulator has not yet
been controlled. If a hand gesture other than ‘‘Hand gesture 1’’ is recognized in this state, it returns to the initial state 1. Keep the ‘‘Hand gesture
1’’ for more than five consecutive frames, then enter the working state 3. If the ‘‘Hand gesture 1’’ is recognized in this state, the hand gesture is
output to control the bionic manipulator.
3DATA FUSION METHOD
Data fusion method plays an important role in multimodal hand gesture recognition tasks (EL-SAYED, 2015; Liu et al., 2014). According to the
order of data fusion, its methods can be subdivided into data level fusion, feature level fusion, and decision level fusion (Kopuklu et al., 2018).
In this paper, data level fusion is utilized to fuse the RGB, depth, and sEMG signals of hand gestures. Compared with feature level fusion and
decision level fusion, data level fusion has the following advantages: (a) The fused data can be extracted by a single-channel deep neural network,
which can effectively reduce the number of parameters and improve the speed of the algorithm and (b) achieve pixel-wise correspondence
GAO ET AL.5of12
FIGURE 4 Finite state machine model diagram of
‘‘Hand gesture 1’’
between multimodal data to improve data fusion efficiency. This section introduces the data fusion method of RGB, depth, and sEMG signals of
the mentioned 10 hand gestures.
3.1 Data correspondence
Fusion of multimodal data needs to maintain the correspondence of the fused data structure and sampling time. The RGB and depth images
acquired by Kinect is 30 fps with a resolution of 640 ×480. The sEMG signal collected by MYO armband is 16 channels, and the frequency
is 1000 Hz (Boyali et al., 2015). The data of the images and the sEMG signal are different in both spatial structure and sampling frequency, so
they cannot be directly fused. Therefore, it is necessary to convert these three kinds of data into a consistent structure and then fuse them. The
process of conversion is shown as follows.
3.1.1 Convert sEMG signals into images
Data collected by the MYO armband in each second is filtered to obtain a matrix M, which indicates the strength of the myoelectric signal. The
size of M is 16 ×1,000. Convert it to a grayscale image with a pixel size of 16 ×1,000 using Equation (1).
sx,y=255 ×m(x,y)
mmax(x,y)−mmin (x,y),(1)
where s(x,y)is the pixel value of the coordinate (x,y)in the image S.s(x,y)∈[0,255],x∈[1,1,000],y∈[1,16].m(x,y)is the value of the sEMG
signal with the coordinate (x,y)in the matrix M.m(x,y)∈[mmin(x,y),mmax(x,y)],mmin(x,y)and mmax (x,y)are the minimum value and maximum
value in the matrix M.
3.1.2 Cut the image S
The image Sis cut into several of small images with a pixel size of 16 ×16. Then, 10 images are uniformly extracted from them, and they are
converted into images with a size of 160 ×160 by upsampling. That is, 10 frames of grayscale images with a pixel size of 160 ×160 are obtained
from the MYO armband per second.
3.1.3 Sampling and cutting hand gesture images in RGB and depth images
Taking the RGB images as an example, 10 frames of images are uniformly sampled in 30 frames of images acquired by the Kinect per second.
These 10 images are cut into images containing hand gestures with a pixel size of 160 ×160. That is, 10 RGB images with a pixel size of 160 ×160
can be obtained every second from Kinect's colour camera. The process of depth images is the same as that of RGB images, 10 depth images
with a pixel size of 160 ×160 can be obtained every second from Kinect's depth camera.
3.1.4 Guarantee the time consistency of the acquired signals
After the signals are handled as mentioned above, these three types of data are collected every 100 ms. As shown in Figure 5, the hand gesture
sEMG image, RGB image, and depth image acquired at the same time tare St,Rt,andDt, respectively:
6of12 GAO ET AL.
FIGURE 5 Data fusion process
StRw×h×cs
RtRw×h×crgb
DtRw×h×cd,
(2)
where cs=1is the channel number of the sEMG image, crgb =3is the channel number of the RGB image, and cd=1is the channel number of
the depth image.
3.2 Data fusion
The way of data fusion is indicated in Figure 3. The depth image Dtand the sEMG image Stare sequentially attached to the RGB image Rtas
additional channels. The equation is as follows:
Λ∶(Rw×h×crgb ,Rw×h×cd,Rw×h×cs)Rw×h×cf,(3)
where Ft(Rt,Dt,St),(4)
cf=crgb +cd+cs,(5)
where Ftis the merged image and cf=5is the number of channels. The three types of data collected at the same time can be converted into
fused data by M. The fused hand gesture image Ftcontains (a) performance features contained in RGB channels; (b) 3D space features contained
in a depth channel; and (c) myoelectric features contained in sEMG channel. Finally, Ftis input as the fused data into the deep neural network.
4MULTISCALE PARALLEL CONVOLUTIONAL NEURAL NETWORK (MPN)
Aiming at feature extraction and recognition of the hand gesture fusion data, we propose an MPN framework. This section mainly introduces
hand gesture database, network framework, and training method.
4.1 Hand gesture database
Because there is currently no public hand gesture database containing RGB images, depth images, and sEMG signals, we have to create a set of
such database containing the above 10 hand gestures. The Kinect sensor is used to capture the RGB and depth images of the hand gestures,
the MYO armband is used to capture the sEMG signal of the hand gestures, and the above data processing method is utilized to convert all
these three data into images with size 160 ×160. Hand gesture data of six subjects are collected, and each subject is collected 625 images for
each hand gesture. In addition, one of these subjects is collected with object occlusion. Therefore, our hand gesture database contains a total of
112,500 images, of which the number of RGB, depth, and sEMG images are all 37,500. Then, by combining these three kinds of images collected
at the same time using the above data fusion method, 37,500 images with a size of 160 ×160 and a channel number of 5 can be obtained.
4.2 Network framework
The proposed MPN frameworkis based on the following ideas: (a)In the current CNN models forimage feature extraction, the residualconnection
structure (He et al., 2016) can train deeper networks. It is implemented by using shortcut connections. Its building block is shown in Figure 4, and
the residual mapping formula is
F(x)=H(x)+x,(6)
where xis the unit map, H(x)is the optimal solution near the unit map and F(x)is the residual map between the unit map and the optimal solution.
GAO ET AL.7of12
FIGURE 6 Multiscale parallel
convolutional neural network framework
The inception structure Szegedy et al. (2016) can avoid computational explosion and extract features from multiple scales. So we add these
two structures to the MPN. The traditional Inception v4 or Inception-ResNet (Szegedy et al., 2016) models are mainly for large-sized (299 ×299)
images. Because of our small data size (160 ×160), the netw ork structure of the Inception v4 network is redesigned for using in the feature
extraction of our hand gesture recognition database. Inspired by Karpathy et al. (2014), the fusion of image information atdifferent spatial scales
can increase the recognition rate of images. Therefore, a parallel deep neural network structure is designed to fuse two kinds of image data
with spatial scales of 160 ×160 and 80 ×80. The specific network framework is shown in Figure 6, and its function modules are presented
in Figure 7.
As shown in Figure 4, the MPN framework is mainly divided into two channels: a high-resolution network (HRN) and a low-resolution network
(LRN). The input to the network is the fused data and downsampled data from the fused data, and the output is a probability vector of the 10
HRI hand gestures. The fused data are downsampled to obtain data with size 80 ×80 ×5, and the two channels process image data of these two
spatial scales in parallel. Among them, HRN mainly extracts and classifies hand gesture data with a spatial size of 160 ×160,andtheLRNmainly
extracts and classifies hand gesture data with a spatial size of 80 ×80. The structures of Stem module, Inception module, and Reduction module
in the MPN are the same as the structures of the corresponding modules in the Inception v4. PH(C|x,WH)is the classification result of HRN, and
PL(C|x,WL)is the classification result of LRN. Then, the two results are fused to obtain the final hand gesture classification result PF(C|x).The
fusion process uses the element-wise method. Its equation is
PF(C|x)=PH(C|x,WH)∗PL(C|x,WL),(7)
where PF(C|x),PH(C|x,WH),PL(C|x,WL)∈R10×1. The predicted label chchooses the maximum value of the vector PF(C|x), its equation is
ch∗= argmaxPF(C|x).(8)
8of12 GAO ET AL.
FIGURE 7 Function modules. (a)
Reduction-A, (b) Reduction-B, (c)
Inception-A, (d) Inception-B, (e)
Inception-C
4.3 Train
Select negative log-likelihood (Norouzi et al., 2016) as the loss function. Its equation is
L(W,DH)=− 1
DH
|DH|
i=0
log(PF(C(i)|x(i),W)),(9)
where DHis the hand gesture database and Wis the weight function.
The training process uses the steep gradient descent, and its iterative function is shown as follows:
Vt+1=𝜇Vt𝛼L(Wt),(10)
Wt+1=Wt+Vt+1,(11)
where tis the current number of iterations. The weight value Wt+1 of the t+1time depends on the weight Wtof the ttime and the weight
increment Vt+1 of the t+1time. The value of Vt+1 is updated by a linear c omparison of the last value Vtand negative gradient. 𝛼is the learning rate
of the gradient and 𝜇is the momentum of the last gradient value. We are required to adjust the values of 𝛼and 𝜇to get the best training results.
Adjust the value of the learning rate 𝛼by the step method (LeCun et al., 2015). Its equation is
𝛼=𝛼0×𝛾(ts).(12)
Among them, 𝛼0is the initial learning rate, 𝛾is the adjustment parameter, and srepresents the iteration length of the adjustment learning rate.
That is to say, when the current number of iterations reaches an integral multiple of s, the learning rate is adjusted.
5EXPERIMENTAL RESULTS AND DISCUSSION
The proposed data fusion method and MPN method are verified under the above hand gesture database, which proves the feasibility and
superiority of the proposed methods.
5.1 Verification of data fusion method
The data collected by a subject with occlusion in the hand gesture database are used as the verification data, and the data collected by the other
five subjects are used as the training data to verify the data fusion method. The RGB images, the depth images, the EMG signal images, and the
fused images of the hand gestures are separately trained and verified using the HRN, and the experimental results are compared.
Set the parameters during the training p roc ess. The values of the parameters 𝜇and 𝛼0are mainly based on experience. We set the value of
𝜇to 0.9 and the value of 𝛼0to 0.001. The value of 𝛾is set to 0.1, the value of the learning rate iteration length sis set to 20,000, and the total
number of training steps is 60,000. That is to say, when the number of training steps is 20,000 and 40,000, the learning rate becomes 0.0001
and 0.00001, respectively.
All experiments are performed in GTX1060, 6 GB memory, and the deep learning framework selects Tensorflow.
GAO ET AL.9of12
Input data Accuracy (%) Speed (ms) GPU
RGB 55.07 19 GTX1060
Depth 47.26 18 GTX1060
sEMG 75.52 18 GTX1060
RGB + Depth + sEMG 88.89 21 GTX1060
The bold emphasis are the highest accuracy and fastest speed.
TABLE 2 Accuracy and speed
FIGURE 8 The comparison of 10 hand
gesture recognition accuracy with four
different input data. The blue bar indicates the
accuracy of hand gesture recognition obtained
by inputting RGB images separately. The cyan
bar indicates the accuracy of hand gesture
recognition obtained by inputting depth
images separately. The yellow bar indicates
the accuracy of hand gesture recognition
obtained by inputting sEMG images
separately. And the red bar indicates the
accuracy of hand gesture recognition obtained
by inputting fusion images
Network Accuracy (%) Speed (ms) GPU
High-resolution network 88.89 21 GTX1060
Low-resolution network 87.32 16 GTX1060
Multiscale parallel convolutional neural network 92.45 32 GTX1060
The bold emphasis are the highest accuracy and fastest speed.
TABLE 3 Accuracy and speed
Through experiments, the average accuracies and time spent of hand gesture recognition using RGB, depth, sEMG data alone and using fused
data can be obtained as shown in Table 2. The recognition accuracy is calculated based on the ratio of the correct hand gesture images to the
total hand gesture images, and the average accuracy is the average of the 10 hand gesture accuracies. The accuracy comparison of the 10 hand
gestures corresponding to these four methods is shown in Figure 8.
It can be viewed in Table 2 that among the experimental results of the four different data, the accuracies of using only RGB and depth data
are very low (RGB: 55.07%, depth: 47.26%). This is because all the hand gesture images in the training process are unoccluded, but some of the
hand gesture images in the verified data are partially occluded or totally occluded. Therefore, it can be observed that in the case of occlusion, the
effect of using vision for hand gesture recognition is not ideal. Recognition accuracy of using sEMG data alone is 75.52%, which indicates that
the sEMG signal is less affected by hand gesture occlusion. However, as can be seen from Figure 5, the recognition accuracy of the hand gesture
8isonly6%, which may be because the sEMG signals of the hand gesture 8 (grip a ball) and the hand gesture 7 (grip a cylinder) are very close,
resulting in mo st of the sEMG imag es of the hand gesture 8 are recognized as hand gesture 7. Although these two hand gestures differ greatly
in RGB images and depth images and can be easily distinguished, recognition accuracy using the fusion data is the highest, reaching 88.89%.
And we can see from Figure 5 that the fusion data have a high recognition accuracy for each hand gesture, wherein the recognition accuracy of
hand gesture 7 is the lowest, reaching 69.38%, and the recognition accuracy of hand gesture 6 is the highest, reaching 100%. Therefore, it can be
proved that the use of fused data can effectively improve the recognition accuracy of hand gestures. In addition, as can be observed in Table 1,
the speed of using the fused data is the slowest in the four methods, reaching 21 ms, but it can also achieve high real-time performance.
5.2 Verification of MPN method
In order to verify the superiority of the proposed MPN method, we use the HRN, LRN, and MPN to train and verify the fused hand gesture data.
And the results are compared. In order to maintain the fairness of the method comparison, each training parameter is set to be consistent with
the above, and the average accuracy and speed of the hand gesture recognition obtained are shown in Table 3. The accuracy comparison of the
10 hand gestures corresponding to these three methods is shown in Figure 9.
As can be seen from Table 3, the MPN has the highest hand gesture recognition accuracy of 92.45%. And we can see from Figure 6 that
the recognition accuracies of all 10 hand gestures obtained by using the MPN are high, where the lowest hand gesture recognition accuracy is
80.85%for the hand gesture 7, and the highest hand gesture recognition accuracy for the hand gesture 4 and the hand gesture 6 are both 100%.
It is proposed that the MPN method can effectively improve the hand gesture recognition accuracy. In addition, it can be seen from Table 3
that the LRN method is the fastest, reaching 16 ms; this is because the parameters of the LRN network are relatively few. The MPN method is
the slowest, reaching 32 ms, but at this speed, the system can still achieve real-time performance. In a word, it is proved that the MPN method
10 of 12 GAO ET AL.
FIGURE 9 The comparison of 10 hand
gesture recognition accuracy with three
different network framewo rks. The grey bar
indicates the accuracy of hand gesture
recognition using only the HRN. The green
bar indicates the accuracy of hand gesture
recognition using only the LRN. And the
purple bar indicates the accuracy of hand
gesture recognition using the MPN. HRN,
high-resolution network; LRN, low-resolution
network; MPN, multiscale parallel
convolutional neural network
FIGURE 10 Hand gesture input and
manipulator motion state diagram. (a)
Hand gesture 1, (b) Hand gesture 2,
(c) Hand gesture 3, (d) Hand gesture
4, (e) Hand gesture 5, (f) Hand
gesture 6, (g) Hand gesture 7, (h)
Hand gesture 8, (i) Hand gesture 9, (j)
Hand gesture 10
TABLE 4 Accuracies of the manipulator corresponding to the
10 hand gestures Handgesturenumber12345678910
Accuracy(%) 96829798899778859091
proposed in this papercan not only effectively improve the recognition accuracy of hand gestures but also can be applied to the real-time control
system of the above-mentioned bionic manipulator.
5.3 Verification of MPN method
Applying the above MPN method to the recognition of the 10 HRI hand gestures can realize online recognition of these hand gestures.
Convert the recognized hand gesture into an action instruction, and then transmit the action instruction to the seven-degree-of-freedom bionic
manipulator according to the FSM model proposed above, this can realize the control of the bionic manipulator by the hand gesture operation.
The hand gesture input and the corresponding motion state corresponding to the bionic manipulator are shown in Figure 10.
In the real-time system, the movement of the bionic manipulator is controlled by hand gestures, and the above 10 kinds of hand gestures are
collected 100 times by Kinect and MYO band, and the manipulator response action is recorded in each time, and then the response accuracies of
the manipulator corresponding to the 10 hand gestures are obtained and which are shown in Table 4.
It can be seen from Table 2 that the accuracies of the manipulator operation corresponding to the above 10 HRI hand gestures are more than
or equalto 78%. Among them, the accuracy of hand gesture 7 is the lowest (78%), which corresponds to the lowest recognition accuracy of hand
gesture 7 in Figure 6. Hand gesture 4 has the highest accuracy of 98%, which corresponds to the highest recognition accuracy of hand gesture
6 in Figure 6. The average accuracy of the manipulator operation is 90.3%, which is 2.15% lower than the hand gesture recognition accuracy of
92.45% by using MPN. This explain that there will be an error of about 2.15% from hand gesture recognition to the manipulator response, which
is accep table in practice. In addition, through experimental records, it is found that almost all erroneous response o perations keep the last ac tio n,
indicating that the proposed FSM method is helpful for stability in the manipulator interaction control. In this case, even if the hand gesture
recognition is wrong, as long as the manipulator keeps the motion unchanged, it will not be affected by the operation error. Therefore, this paper
proves that the proposed hand gesture recognition method and the HRI method are effective and superior to gesture-based robot control.
GAO ET AL.11 of 12
6CONCLUSION REMARK AND FUTURE WORK
In this paper, focus on the gesture-based HRI for a bionic manipulator on the AAR, a method using data fusion and multiscale parallel neural
network is proposed to improve the recognition accuracy o f 10 HRI hand gestures. The contributions and innovation s of this paper are summarized
as follows: (a) For the control of the seven-degree-of-freedom bionic manipulator, 10 commonly used HRI hand gestures are designed, and a
corresponding hand gesture database is made for these 10 hand gestures. The database contains RGB, depth, and sEMG data. (b) A data fusion
method is proposed to fuse RGB, depth, and sEMG signals with different scales to achieve consistency of these three data sizes and sampling
time. (c) A multiscale parallel convolutional neural network framework is proposed to improve the recognition accuracy of hand gestures.
In the next step, our research will be performed on dynamic hand gestures. Because the recognition of dynamic hand gestures is more practical,
the data fusion method and recognition are more difficult. In the future, the method proposed in this paper needs to be improved to make it
applicable to the recognition task of dynamic hand gestures.
ACKNOWLEDGEMENT
National Key R&D Program of China, Grant 2018YFB1304600; Natural Science Foundation of China, Grants 51775541 and 51575412; CAS
Interdisciplinary Innovation Team, Grant JCTD-2018-11; the EU Seventh Framework Programme (FP7)-ICT, Grant 611391.
CONFLICT OF INTERESTS
The authors declare no conflict of interests.
ORCID
Qing Gao https:// orcid.org/0000-0002-5695- 7405
Jinguo Liu https:// orcid.org/0000- 0002- 6790- 6582
Zhaojie Ju https://orcid.org/0000- 0002-9524-7609
REFERENCES
Benalcázar, M. E., Jaramillo, A. G., Zea, A., Páez, A., Andaluz, V. H., et al. (2017). Hand gesture recognition using machine learning and the myo armband. In
2017 25th European Signal Processing Conference (EUSIPCO), IEEE, pp. 1040–1044.
Boyali, A., Hashimoto, N., & Matsumoto, O. (2015). Hand posture and gesture recognition using MYO armband and spectral collaborative representation
based classification. In 2015 IEEE 4th Global Conference on Consumer Electronics (GCCE), IEEE, pp. 200–201.
Chen, C., Jafari, R., & Kehtarnavaz, N. (2015). Utd-mhad: A multimodal dataset for humanaction recognition utilizing adepth camera and awearable inertial
sensor. In 2015 IEEE International conference on image processing (ICIP), IEEE, pp. 168–172.
EL-SAYED, A. (2015). Multi-biometric systems: A state of the art survey and research directions. IJACSA) International Journalo f Advanced Computer Science
and Applications,6.
Gao, Q., Liu, J., Ju, Z., Li, Y., Zhang, T., & Zhang, L. (2017a). Static hand gesture recognition with parallel CNNs for space human-robot interaction. In
International Conference on Intelligent Robotic s and Applications, Springer, pp. 462–473.
Gao, Q., Liu, J., Ju, Z., & Zhang, X. (2019). Dual-hand detection for human-robot interaction by a parallel network based on hand detection and body pose
estimation. IEEE Transactions on Industrial Electronics.
Gao, Q., Liu, J., Tian, T., & Li, Y. (2017b). Free-flying dynamics and control of an astronaut assistant robot based on fuzzy sliding mode algorithm. Acta
Astronautica,138, 462–474.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residuallearning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 770–778.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neuralnetworks.In
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732.
Kopuklu, O., Kose, N., & Rigoll, G. (2018). Motion fused frames: Data level fusion strategy forhand gesture recognition. In Proceedings of the IEEEConference
on Computer Vision and Pattern Recognition Workshops, pp. 2103–2111.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature,521(7553), 436.
Liu, K., Chen, C., Jafari, R., & Kehtarnavaz, N. (2014). Fusion of inertial and depth sensor data for robust hand gesture recognition. IEEE Sensors Journal,
14(6), 1898–1903.
Liu, J., Gao, Q., Liu, Z., & Li, Y. (2016a). Attitude control for astronaut assisted robot in the space station. International Journal of Control, Automation and
Systems,14(4), 1082–1095.
Liu, J., Luo, Y., & Ju, Z. (2016b). An interactive astronaut-robot system with gesture control. Computat ional intelligence and neuroscience,2016.
Malima, A. K., Özgür, Erol, & Çetin, Müjdat (2006). A fast algorithm for vision-based hand gesture recognition for robot control.
Miao, Q., Li, Y., Ouyang, W., Ma, Z., Xu, X., Shi, W., & Cao, X. (2017). Multimodal gesture recognition based on the resc3d network. In Proceedings of the
IEEE Inte rnational Conference on Computer Vision, pp. 3047–3055.
Molchanov, P., Gupta, S., Kim, K., & Kautz, J. (2015). Hand gesture recognition with 3D convolutional neural networks. In Proceedings of the IEEE conference
on computer vision and pattern recognition workshops, pp. 1–7.
12 of 12 GAO ET AL.
Norouzi, M., Bengio, S., Jaitly, N., Schuster, M., Wu, Y., Schuurmans, D., et al. (2016). Reward augmented maximum likelihood for neural structured
prediction. In Advances In Neural Information Processing Systems, pp. 1723–1731.
Raheja, J. L., Shyam, R., Kumar, U., & Prasad, PBhanu (2010). Real-time robotic hand control using hand gestures. In 2010 Second International Conference
on Machine Learning and Computing, IEEE, pp. 12–16.
Rautaray, SiddharthS, & Agrawal, A. (2015). Vision based hand gesture recognition for human computer interaction: A survey. Artificial int elligence review,
43(1), 1–54.
Ren, Z., Yuan, J., Meng, J., & Zhang, Z. (2013). Robust part-based hand gesture recognition using kinect sensor. IEEE transactions on multimedia,15(5),
1110–1120.
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in neural informatio n processing
systems, pp. 568–576.
Smith, A. V. W., Sutherland, A. I., Lemoine, A., & Mcgrath, S. (2000). Hand gesture recognition system and method. US Patent 6,128,003.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 2818–2826.
AUTHOR BIOGRAPHIES
Qing Gao was born in Tangshan, China. He received his BS degree in automation from Electrical Engineering and Automation School,
Liaoning Technology University, China, in 2013. Currently, he is working towards his PhD degree in the State Key Laboratory of Robotics,
Shenyang Institute of Automation (SIA), Chinese Academy of Sciences (CAS), Shenyang, China. His research interests include space robot,
artificial intelligence, machine vision, and human–robot interaction. He has authored or coauthored over 10 publications in journals and
conference proceedings in above areas and received one outstanding paper award.
Jinguo Liu received his PhD degree in robotics from Shenyang Institute of Automation (SIA), Chinese Academy of Sciences (CAS)in 2007.
His research interests include modular robot, rescue robot, space robot, and bio-inspired robot. Since January 2011, he has been a Full
Professor with SIA, CAS. He also holds the Assistant Director position of State Key Laboratory of Robotics (China) from March 2008.
He has authored and coauthored over 80 papers and 30 patents in above areas. He is a member of IEEE, a Senior Member of China
Mechanical Engineering Society, and the lead guest editor of International Journal of Advances in Mechanical Engineering.
Zhaojie Ju received his BS in automatic control and his MS in intelligent robotics both from Huazhong University of Science and
Technology, China, in 2005 and2007, respectively, andhis PhD degree in intelligent robotics atthe University of Portsmouth, UK, in 2010.
His research interests include machine intelligence, pattern recognition, and their applications on human motion analysis, human–robot
interaction and collaboration, and robot skill learning. He is currently a Senior Lecturer in the School of Computing, University of
Portsmouth. He previously held research appointments atUniversity College London and University of Portsmouth, UK. He is an Associate
Editor of the IEEE Transactions on Cybernetics. He has authored or coauthored over 100 publications in journals, book chapters, and
conference proceedings and received four best paper awards.
How to cite this article: Gao Q, Liu J, Ju Z. Hand gesture recognition using multimodal data fusion and multiscale parallel convolutional
neural network for human–robot interaction. Expert Systems. 2020;e12490. https:// doi.org/10.1111/exsy.12490
... The prediction of the grip force using pattern recognition algorithms has received lesser attention as compared to gesture type prediction [12,25,26]. The algorithm employed frequently for the grip force estimation is the proportional algorithm, where the grip force proportionally changes with the change in EMG activity [27]. ...
Article
Full-text available
Different grasping gestures result in the change of muscular activity of the forearm muscles. Similarly, the muscular activity changes with a change in grip force while grasping the object. This change in muscular activity, measured by a technique called Electromyography (EMG) is used in the upper limb bionic devices to select the grasping gesture. Previous research studies have shown gesture classification using pattern recognition control schemes. However, the use of EMG signals for force manipulation is less focused, especially during precision grasping. In this study, an early predictive control scheme is designed for the efficient determination of grip force using EMG signals from forearm muscles and digit force signals. The optimal pattern recognition (PR) control schemes are investigated using three different inputs of two signals: EMG signals, digit force signals and a combination of EMG and digit force signals. The features extracted from EMG signals included Slope Sign Change, Willison Amplitude, Auto Regressive Coefficient and Waveform Length. The classifiers used to predict force levels are Random Forest, Gradient Boosting, Linear Discriminant Analysis, Support Vector Machines, k-nearest Neighbors and Decision Tree. The two-fold objectives of early prediction and high classification accuracy of grip force level were obtained using EMG signals and digit force signals as inputs and Random Forest as a classifier. The earliest prediction was possible at 1000 ms from the onset of the gripping of the object with a mean classification accuracy of 90 % for different grasping gestures. Using this approach to study, an early prediction will result in the determination of force level before the object is lifted from the surface. This approach will also result in better biomimetic regulation of the grip force during precision grasp, especially for a population facing vision deficiency.
... During the collection of sEMG signals, data of other modalities may be collected simultaneously [20][21][22]. Therefore, multimodal gesture-recognition methods that fuse the features of multimodal data are introduced to achieve further improvement [32,33]. In contrast, Hu et al. [16] utilized the hand poses to model the cross-modal association via adversarial learning during the training phase and improved the cross-trial gesture recognition performance during the test phase, barely using sEMG signals. ...
Article
Full-text available
To enhance the performance of surface electromyography (sEMG)-based gesture recognition, we propose a novel network-agnostic two-stage training scheme, called sEMGPoseMIM, that produces trial-invariant representations to be aligned with corresponding hand movements via cross-modal knowledge distillation. In the first stage, an sEMG encoder is trained via cross-trial mutual information maximization using the sEMG sequences sampled from the same time step but different trials in a contrastive learning manner. In the second stage, the learned sEMG encoder is fine-tuned with the supervision of gesture and hand movements in a knowledge-distillation manner. In addition, we propose a novel network called sEMGXCM as the sEMG encoder. Comprehensive experiments on seven sparse multichannel sEMG databases are conducted to demonstrate the effectiveness of the training scheme sEMGPoseMIM and the network sEMGXCM, which achieves an average improvement of +1.3% on the sparse multichannel sEMG databases compared to the existing methods. Furthermore, the comparison between training sEMGXCM and other existing networks from scratch shows that sEMGXCM outperforms the others by an average of +1.5%.
... Wan et al. presented robot a vision model using a faster R-CNN for multi-class fruits detection, which obtained efficient, more accurate, and faster detection [24]. Gao et al. presented a multi-modal data fusion and multi-scale parallel CNN model, which enhances the accuracy and dependability of robot gesture recognition [7]. Singh et al. presented a Multi-level Particle Swarm Optimization (MPSO) algorithm, which can be used in a specified search space to recommend the optimal CNN architecture and its hyperparameters [19]. ...
Chapter
Effective feature extraction is a key component in image recognition for robot vision. This paper presents an improved contrastive learning-based image feature extraction and classification model, termed SimCLR-Inception, to realize effective and accurate image recognition. By using the SimCLR, this model generates positive and negative image samples from unlabeled data through image augmentation and then minimizes the contrastive loss function to learn the image representations by exploring more underlying structure information. Furthermore, this proposed model uses the Inception V3 model to classify the image representations for improving recognition accuracy. The SimCLR-Inception model is compared with four representative image recognition models, including LeNet, VGG16, Inception V3, and EfficientNet V2 on a real-world Multi-class Weather (MW) data set. We use four representative metrics: accuracy, precision, recall, and F1-Score, to verify the performance of different models for image recognition. We show that the presented SimCLR-Inception model achieves all the successful runs and gives almost the best results. The accuracy is at least \(4\%\) improved by the Inception V3 model. It suggests that this model would work better for robot vision.
... They addressed recognition confidence measures for the gestures that humans and robots express using different hand configurations. Multimodal data fusion and multiscale parallel convolutional neural networks for human-robot interaction were presented by Gao et al. (Gao et al., 2021). The devised method was applied to a sevendegree-of-freedom bionic manipulator to achieve robotic manipulation with hand gestures. ...
Conference Paper
Human-robot cooperation plays an increasingly important role in manufacturing applications. Together, humans and robots display an exceptional skill level that neither can achieve independently. For such cooperation, hand gesture communication using computer vision has been proven to be the most suitable due to the low cost of implementation and flexibility. Therefore, this work focuses on the hand gesture classification problem in view of human and robot collaboration. To facilitate collaboration, six of the most common gestures applicable in manufacturing applications were selected. The first part of the research was devoted to creating an image dataset using the proposed acquisition system. Then, pre-trained neural networks were designed and tested. In this step, the feature extraction approach was adopted, which utilises the representations learned by a previous network to extract meaningful features. The results suggest that all developed pre-trained networks attained high accuracy (above 98,9%). Among them, the VGG19 demonstrated the best performance, achieving accuracy equal to 99,63%. The proposed approach can be easily adapted to recognise a more extensive or different set of gestures. Utilising the proposed vision system and the developed neural network architectures, the adaptation demands only acquiring a set of images and retraining the developed networks.
Article
Unfortunately, there are still cases of domestic violence or situations where it is necessary to call for help without arousing the suspicion of the aggressor. In these situations, the help signal devised by the Canadian Women’s Foundation has proven to be effective in reporting a risky situation. By displaying a sequence of hand signals, it is possible to report that help is needed. This work presents a vision-based system that detects this sequence and implements it in a social robot, so that it can automatically identify unwanted situations and alert the authorities. The gesture recognition pipeline presented in this work is integrated into a cognitive architecture used to generate behaviours in robots. In this way, the robot interacts with humans and is able to detect if a person is calling for help. In that case, the robot will act accordingly without alerting the aggressor. The proposed vision system uses the MediaPipe library to detect people in an image and locate the hands, from which it extracts a set of hand landmarks that identify which gesture is being made. By analysing the sequence of detected gestures, it can identify whether a person is performing the distress hand signal with an accuracy of 96.43%.
Chapter
3D human pose estimation is widely used in motion capture, human-computer interaction, virtual character driving and other fields. The current 3D human pose estimation has been suffering from depth blurring and self-obscuring problems to be solved. This paper proposes a human pose estimation network in video based on a 2D lifting to 3D approach using transformer and graph convolutional network(GCN), which are widely used in natural language processing. We use transformer to obtain sequence features and use graph convolution to extract features between local joints to get more accurate 3D pose coordinates. In addition, we use the proposed 3D pose estimation network for animated character motion generation and robot motion following and design two systems of human-computer/robot interaction (HCI/HRI) applications. The proposed 3D human pose estimation network is tested on the Human3.6M dataset and outperforms the state-of-the-art models. Both HCI/HRI systems are designed to work quickly and accurately by the proposed 3D human pose estimation method.
Conference Paper
Presentations are a powerful tool for presenters who want to persuade their audiences in today's digital age. This paper exploits advances in hand gesture recognition, and proposes a virtual control system for presentations. The proposed system utilizes a webcam or built-in camera to capture hand gestures. Based on hand gestures, presentations can be controlled virtually and change presentation slides in both forward and backward directions. It is also possible by using the proposed system to get a pointer on the slide, write, or draw virtually on the screen through specific hand gestures. The obtained results show that the proposed system has a high accuracy of 96% in recognizing hand gestures and thus controlling presentations remotely without using any external device.
Article
Full-text available
In this study, a parallel network based on hand detection and body pose estimation is proposed to detect and distinguish human's right and left hands. The network is used for human-robot interaction (HRI) based on hand gestures. This method fully uses the hand feature information and hand information in the human body structure. One channel in the network uses a ResNet-Inception-Single Shot MultiBox Detector to extract hand feature information for human's hand detection. The other channel estimates human body pose first and then estimates the positions of the left and right hands using the forward kinematic tree of the human skeleton structure. Thereafter, the results of the two channels are fused. In the fusion module, the human body structure can be used to correct hand detection results and distinguish between the right and left hands. Experimental results verify that the parallel deep neural network can effectively improve the accuracy of hand detection and distinguish between the right and left hands effectively. In addition, this method is used for the hand gesture-based interaction between astronauts and an astronaut assistant robot. Our method can be suitably used in this HRI system.
Conference Paper
Full-text available
Acquiring spatio-temporal states of an action is the most crucial step for action classification. In this paper, we propose a data level fusion strategy, Motion Fused Frames (MFFs), designed to fuse motion information into static images as better representatives of spatio-temporal states of an action. MFFs can be used as input to any deep learning architecture with very little modification on the network. We evaluate MFFs on hand gesture recognition tasks using three video datasets - Jester, ChaLearn LAP IsoGD and NVIDIA Dynamic Hand Gesture Datasets - which require capturing long-term temporal relations of hand movements. Our approach obtains very competitive performance on Jester and ChaLearn benchmarks with the classification accuracies of 96.28% and 57.4%, respectively, while achieving state-of-the-art performance with 84.7% accuracy on NVIDIA benchmark.
Conference Paper
Full-text available
As a new type of human-robot interaction (HRI), hand gesture has many advantages such as natural operation, rich expression and not subject to environ-mental constraints. So it is very suitable for space human-robot interaction tasks in special and harsh environment. Considering that static hand gesture is one of the main gesture expressions in human-computer interaction, so a parallel convolution neural networks (CNNs) is designed to improve the accuracy of static hand gesture recognition in the conditions of complex background and changing illumination. In addition, the method is applied to the operation of space human-robot system with hand gesture control. Various space HRI hand gestures from different subjects are evaluated and tested, and experimental results demonstrate that the proposed method outperforms the single-channel CNN methods and other popular methods with a higher accuracy.
Article
Full-text available
Space robots can perform some tasks in harsh environment as assistants of astronauts or substitutions of astronauts. Taking the limited working time and the arduous task of the astronauts in the space station into account, an astronaut assistant robot (AAR-2) applied in the space station is proposed and designed in this paper. The AAR-2 is achieved with some improvements on the basis of AAR-1 which was designed before. It can exploit its position and attitude sensors and control system to free flight or hover in the space cabin. And it also has a definite environmental awareness and artificial intelligence to complete some specified tasks under the control of astronauts or autonomously. In this paper, it mainly analyzes and controls the 6-DOF motion of the AAR-2. Firstly, the system configuration of AAR-2 is specifically described, and the movement principles are analyzed. Secondly, according to the physical model of the AAR-2, the Newton - Euler equation is applied in the preparation of space dynamics model of 6-DOF motion. Then, according to the mathematical model's characteristics which are nonlinear and strong coupling, a dual closed loop position and attitude controller based on fuzzy sliding mode control is proposed and designed. Finally, simulation experiments are appropriate to provide for AAR-2 control system by using Matlab/Simulink. From the simulation results it can be observed that the designed fuzzy sliding mode controller can control the 6-DOF motion of AAR-2 quickly and precisely.
Article
A key problem in structured output prediction is direct optimization of the task reward function that matters for test evaluation. This paper presents a simple and computationally efficient approach to incorporate task reward into a maximum likelihood framework. We establish a connection between the log-likelihood and regularized expected reward objectives, showing that at a zero temperature, they are approximately equivalent in the vicinity of the optimal solution. We show that optimal regularized expected reward is achieved when the conditional distribution of the outputs given the inputs is proportional to their exponentiated (temperature adjusted) rewards. Based on this observation, we optimize conditional log-probability of edited outputs that are sampled proportionally to their scaled exponentiated reward. We apply this framework to optimize edit distance in the output label space. Experiments on speech recognition and machine translation for neural sequence to sequence models show notable improvements over a maximum likelihood baseline by using edit distance augmented maximum likelihood.
Conference Paper
Convolutional networks are at the core of most stateof-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 17.3% top-1 error.