Eyemotion visual schematic. A: A user wearing the VR HMD used for expression tracking (Note that no external camera is used in our method; this is just for visualization). B: Interior of the HMD, with IR LEDs visible around the radius of the eyepieces, highlighted with red circles. C: Captured eye data. D: Model inference with dynamic avatar representation. 

Eyemotion visual schematic. A: A user wearing the VR HMD used for expression tracking (Note that no external camera is used in our method; this is just for visualization). B: Interior of the HMD, with IR LEDs visible around the radius of the eyepieces, highlighted with red circles. C: Captured eye data. D: Model inference with dynamic avatar representation. 

Source publication
Article
Full-text available
One of the main challenges of social interaction in virtual reality settings is that head-mounted displays occlude a large portion of the face, blocking facial expressions and thereby restricting social engagement cues among users. Hence, auxiliary means of sensing and conveying these expressions are needed. We present an algorithm to automatically...

Contexts in source publication

Context 1
... model classifies user expressions using only lim- ited periocular eye image data, as shown in Fig. 1C, which is further limited by the large amount of intra-class vari- ation among users. To account for limited data, a new type of data, and large variations, we turn to deep learn- ing techniques. Recently convolutional neural networks (CNNs) [17,14,32] have performed very well on image classification tasks and are pervasive in machine learning and computer vision. Additionally, deep learning methods have the benefit of learning important invariant features and embeddings without requiring any hand-crafted feature rep- resentations. Deep learning has also been shown to give state of the art results on faces [19,31]. Our approach, based on deep learning, outperforms normal human accu- racy and even advanced (trained users) human accuracy for categorizing select facial expressions from our dataset of only IR eye images. Human ratings form the primary base- line for our work. We use these ratings for comparison and evaluation, but not as labels during ...
Context 2
... classification of expressions has been a well stud- ied topic in computer vision. However, most of this work is aimed at classification with a fixed front-facing camera that relies on seeing the full face of a user [13,20,28]. We focus on a more challenging scenario with no fixed external camera, where the user is wearing a head-mounted display (HMD) in a VR setting as shown in (Fig. ...
Context 3
... are motivated by the recent availability of commer- cial HMDs with eye-tracking cameras [1]. To avoid inter- ference with the VR display, infrared cameras are mounted behind the lens, and point at the eyes and surrounding areas (Fig. 1B). The images are typically used for eye-gaze esti- mation and for applications such as foveated rendering [26], but in our work we use the same input images for expression classification. A key aspect of our work is a labeling-based approach -as opposed to visual tracking of facial features -to identify expressions and action units only from these limited field-of-view eye ...

Similar publications

Thesis
Full-text available
Les fauteuils roulants électriques (FRE) ont permis à nombre de personnes handicapés moteurs de retrouver une mobilité satisfaisante, ce qui a amélioré leur qualité de vie, un vaste champ d’activités leur étant devenu accessible. Lors de la prescription d’un FRE ou d’une phase d’apprentissage à la conduite il est cependant nécessaire d’évaluer les...

Citations

... This platform includes integrated 9DoF accelerometer-gyroscope and photoplethysmograph pulse rate sensors and hence provides movement, posture and state analysis. Hickson et al. presented an algorithm to automatically infer expressions by analyzing only a partially occluded face while the user was engaged in a VR experience (Hickson et al. 2019). They considered that images of the user's eyes captured from an IR gaze-tracking camera within an HMD are sufficient to infer a subset of facial expressions without the use of any fixed external camera. ...
Article
Full-text available
Facial expression recognition (FER) is an important method to study and distinguish human emotions. In the virtual reality (VR) context, people’s emotions are instantly and naturally triggered and mobilized due to the high immersion and realism of VR. However, when people are wearing head mounted display (HMD) VR equipment, the eye regions will be covered. The FER accuracy will be reduced if the eye region information is discarded. Therefore, it is necessary to obtain the information of eye regions using other methods. The main difficulty in FER in an immersive VR context is that the conventional FER methods depend on public databases. The image facial information in the public databases is complete, so these methods are difficult to directly apply to the VR context. To solve this problem, this paper designs and implements a solution for FER in the VR context as follows. A real facial expression database collection scheme in the VR context is implemented by adding an infrared camera and infrared light source to the HMD. A virtual database construction method is presented for FER in the VR context, which can improve the generalization of models. A deep network named the multi-region facial expression recognition model is designed for FER in the VR context.
... Magagna [1] illustrated that the facial expression part of a speaker can account for 55% to the interpretation in the conversations, while the verbal part (i.e., part relates to words) and the vocal part (i.e., part relates to the sound) contribute only to 7% and 38%, respectively [1]. There are also wide range applications of facial expression recognition in the human-computer interaction [2], especially for the virtual reality and augmented reality system [3,4], and healthcare systems (e.g., facial nerve grading [5,6]). ...
Article
Facial expression recognition plays an essential role in human conversation and human–computer interaction. Previous research studies have recognized facial expressions mainly based on 2D image processing requiring sensitive feature engineering and conventional machine learning approaches. The purpose of the present study was to recognize facial expressions by applying a new class of deep learning called geometric deep learning directly on 3D point cloud data. Two databases (Bosphorus and SIAT-3DFE) were used. The Bosphorus database includes sixty-five subjects with seven basic expressions (i.e., anger, disgust, fearness, happiness, sadness, surprise, and neutral). The SIAT-3DFE database has 150 subjects and 4 basic facial expressions (neutral, happiness, sadness, and surprise). First, preprocessing procedures such as face center cropping, data augmentation, and point cloud denoising were applied on 3D face scans. Then, a geometric deep learning model called PointNet++ was applied. A hyperparameter tuning process was performed to find the optimal model parameters. Finally, the developed model was evaluated using the recognition rate and confusion matrix. The facial expression recognition accuracy on the Bosphorus database was 69.01% for 7 expressions and could reach 85.85% when recognizing five specific expressions (anger, disgust, happiness, surprise, and neutral). The recognition rate was 78.70% with the SIAT-3DFE database. The present study suggested that 3D point cloud could be directly processed for facial expression recognition by using geometric deep learning approach. In perspectives, the developed model will be applied for facial palsy patients to guide and optimize the functional rehabilitation program.Graphical abstract
... In [31], facial action unit intensity is estimated in a self-supervise manner by utilizing a differentiable rendering layer for fitting the expression and to retarget the expression to the character. In contrast, expression transfer from a VR headset [23,10,12,18] is more challenging due to partial visibility of face in HMC images, the specific hardware, and limited existing data. ...
Preprint
Full-text available
Social presence, the feeling of being there with a real person, will fuel the next generation of communication systems driven by digital humans in virtual reality (VR). The best 3D video-realistic VR avatars that minimize the uncanny effect rely on person-specific (PS) models. However, these PS models are time-consuming to build and are typically trained with limited data variability, which results in poor generalization and robustness. Major sources of variability that affects the accuracy of facial expression transfer algorithms include using different VR headsets (e.g., camera configuration, slop of the headset), facial appearance changes over time (e.g., beard, make-up), and environmental factors (e.g., lighting, backgrounds). This is a major drawback for the scalability of these models in VR. This paper makes progress in overcoming these limitations by proposing an end-to-end multi-identity architecture (MIA) trained with specialized augmentation strategies. MIA drives the shape component of the avatar from three cameras in the VR headset (two eyes, one mouth), in untrained subjects, using minimal personalized information (i.e., neutral 3D mesh shape). Similarly, if the PS texture decoder is available, MIA is able to drive the full avatar (shape+texture) robustly outperforming PS models in challenging scenarios. Our key contribution to improve robustness and generalization, is that our method implicitly decouples, in an unsupervised manner, the facial expression from nuisance factors (e.g., headset, environment, facial appearance). We demonstrate the superior performance and robustness of the proposed method versus state-of-the-art PS approaches in a variety of experiments.
... Antecedent studies have reported approximate accuracy in classification using similar classifiers and physiological measurements. Hickson et al. (2017) achieved a mean accuracy of 73.7% in classifying facial expression using measurements from eye-tracking cameras. Chew et al. (2016) In the comparison of classifications with different input features, integrated features of eye-tracking metrics and EEG measurements produce higher mean accuracy than separated features. ...
... Recognition of human emotion from images is an interesting research topic, the results of which can be implemented in facial expression recognition (FER). Currently, the results of research on automatic FER have been used in many applications such as human-computer interaction [1,2]; virtual reality (VR)- [3] and augmented reality (AR)- [4] based games [5,6]; customer marketing and advertising; education [7]; and advanced driver assistant systems (ADASs) [8]. In particular, FER is one of the most important factors of ADASs, because it can be used to detect driver fatigue and, in conjunction with the rapidly developing intelligent vehicle technologies, assist safe driving. ...
Article
Full-text available
In recent years, researchers of deep neural networks (DNNs)-based facial expression recognition (FER) have reported results showing that these approaches overcome the limitations of conventional machine learning-based FER approaches. However, as DNN-based FER approaches require an excessive amount of memory and incur high processing costs, their application in various fields is very limited and depends on the hardware specifications. In this paper, we propose a fast FER algorithm for monitoring a driver’s emotions that is capable of operating in low specification devices installed in vehicles. For this purpose, a hierarchical weighted random forest (WRF) classifier that is trained based on the similarity of sample data, in order to improve its accuracy, is employed. In the first step, facial landmarks are detected from input images and geometric features are extracted, considering the spatial position between landmarks. These feature vectors are then implemented in the proposed hierarchical WRF classifier to classify facial expressions. Our method was evaluated experimentally using three databases, extended Cohn-Kanade database (CK+), MMI and the Keimyung University Facial Expression of Drivers (KMU-FED) database, and its performance was compared with that of state-of-the-art methods. The results show that our proposed method yields a performance similar to that of deep learning FER methods as 92.6% for CK+ and 76.7% for MMI, with a significantly reduced processing cost approximately 3731 times less than that of the DNN method. These results confirm that the proposed method is optimized for real-time embedded applications having limited computing resources.
... Furthermore, deep learning techniques have been thoroughly applied by the participants of these two challenges (e.g., [229], [230], [231], [232]). Additional related real-world applications, such as the Real-time FER App for smartphones [233], [234], Eyemotion (FER using eye-tracking cameras) [235], privacy-preserving mobile analytics [236], and Unfelt emotions [237], have also been developed. ...
Article
With the transition of facial expression recognition (FER) from laboratory-controlled to challenging in-the-wild conditions and the recent success of deep learning techniques in various fields, deep neural networks have increasingly been leveraged to learn discriminative representations for automatic FER. Recent deep FER systems generally focus on two important issues: overfitting caused by a lack of sufficient training data and expression-unrelated variations, such as illumination, head pose and identity bias. In this paper, we provide a comprehensive survey on deep FER, including datasets and algorithms that provide insights into these intrinsic problems. First, we describe the standard pipeline of a deep FER system with the related background knowledge and suggestions of applicable implementations for each stage. We then introduce the available datasets that are widely used in the literature and provide accepted data selection and evaluation principles for these datasets. For the state of the art in deep FER, we review existing novel deep neural networks and related training strategies that are designed for FER based on both static images and dynamic image sequences, and discuss their advantages and limitations. Competitive performances on widely used benchmarks are also summarized in this section. We then extend our survey to additional related issues and application scenarios. Finally, we review the remaining challenges and corresponding opportunities in this field as well as future directions for the design of robust deep FER systems.
... In this paper, the term FER refers to facial emotion recognition as this study deals with the general aspects of recognition of facial emotion expression.) has also been increasing recently with the rapid development of artificial intelligent techniques, including in human-computer interaction (HCI) [3,4], virtual reality (VR) [5], augment reality (AR) [6], advanced driver assistant systems (ADASs) [7], and entertainment [8,9]. Although various sensors such as an electromyograph (EMG), electrocardiogram promising type of sensor because it provides the most informative clues for FER and does not need to be worn. ...
Article
Full-text available
Facial emotion recognition (FER) is an important topic in the fields of computer vision and artificial intelligence owing to its significant academic and commercial potential. Although FER can be conducted using multiple sensors, this review focuses on studies that exclusively use facial images, because visual expressions are one of the main information channels in interpersonal communication. This paper provides a brief review of researches in the field of FER conducted over the past decades. First, conventional FER approaches are described along with a summary of the representative categories of FER systems and their main algorithms. Deep-learning-based FER approaches using deep networks enabling “end-to-end” learning are then presented. This review also focuses on an up-to-date hybrid deep-learning approach combining a convolutional neural network (CNN) for the spatial features of an individual frame and long short-term memory (LSTM) for temporal features of consecutive frames. In the later part of this paper, a brief review of publicly available evaluation metrics is given, and a comparison with benchmark results, which are a standard for a quantitative comparison of FER researches, is described. This review can serve as a brief guidebook to newcomers in the field of FER, providing basic knowledge and a general understanding of the latest state-of-the-art studies, as well as to experienced researchers looking for productive directions for future work.
Article
Low cost virtual reality (VR) headsets powered by smartphones are becoming ubiquitous. Their unique position on the user's face opens interesting opportunities for interactive sensing. In this paper, we describe EyeSpyVR, a software-only eye sensing approach for smartphone-based VR, which uses a phone's front facing camera as a sensor and its display as a passive illuminator. Our proof-of-concept system, using a commodity Apple iPhone, enables four sensing modalities: detecting when the VR head set is worn, detecting blinks, recognizing the wearer's identity, and coarse gaze tracking - features typically found in high-end or specialty VR headsets. We demonstrate the utility and accuracy of EyeSpyVR in a series of studies with 70 participants, finding a worn detection of 100%, blink detection rate of 95.3%, family user identification accuracy of 81.4%, and mean gaze tracking error of 10.8° when calibrated to the wearer (12.9° without calibration). These sensing abilities can be used by developers to enable new interactive features and more immersive VR experiences on existing, off-the-shelf hardware.