Article

A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... For instance, ML applied to physiological signals and imaging data, could provide a strong contribution for diagnostic purposes [23,24].ML can be of relevant importance in the assessment of cognitive effort, when applied to physiological data for HMI. In fact, physiological signals, including heart rate (HR), brain activity, skin conductance, and eye movements, but also facial expressions, provide contemporaneous information on an individual's cognitive and emotional condition [25][26][27]. By using ML algorithms to analyze these signals, it becomes feasible to understand and measure cognitive burden, which refers to the level of mental exertion shown by the brain. ...
... The feature learning method for unimodal approach can be used for multimodal one. Features are separately extracted, according to the characteristics of single modalities [25]. ...
Article
Full-text available
Background: Human-Machine Interaction (HMI) has been an important field of research in recent years, since machines will continue to be embedded in many human actvities in several contexts, such as industry and healthcare. Monitoring in an ecological mannerthe cognitive workload (CW) of users, who interact with machines, is crucial to assess their level of engagement in activities and the required effort, with the goal of preventing stressful circumstances. This study provides a comprehensive analysis of the assessment of CW using wearable sensors in HMI. Methods: this narrative review explores several techniques and procedures for collecting physiological data through wearable sensors with the possibility to integrate these multiple physiological signals, providing a multimodal monitoring of the individuals’CW. Finally, it focuses on the impact of artificial intelligence methods in the physiological signals data analysis to provide models of the CW to be exploited in HMI. Results: the review provided a comprehensive evaluation of the wearables, physiological signals, and methods of data analysis for CW evaluation in HMI. Conclusion: the literature highlighted the feasibility of employing wearable sensors to collect physiological signals for an ecological CW monitoring in HMI scenarios. However, challenges remain in standardizing these measures across different populations and contexts.
... It is noteworthy that recently, an increasing number of scholars have started to use multimodal information as input for models to enhance the accuracy of emotion recognition: for instance, Pan et al. conducted a review of multimodal emotion recognition methods [13]; Syed Zaidi et al. used a multimodal dual attention transformer for speech emotion recognition [14]; Luna-Jiménez et al. proposed an automatic emotion recognizer model composed of speech and facial emotion recognizers [15], demonstrating the potential of multimodal approaches in emotion recognition and providing new perspectives and directions for future research improvements; Siriwardhana et al. [5] as well as Tripathi et al. [6] employed text, audio, and visual inputs combined with self-supervised learning for emotion recognition; Wang et al. [7] along with Voloshina et al. [8] each proposed multimodal enhancement fusion methods based on the transformer to improve the accuracy of emotion recognition; and Vu et al. combined multimodal technology with scaled data to recognize emotions from internal human signals [16]. ...
Article
Full-text available
This study is dedicated to developing an innovative method for evaluating spoken English by integrating large language models (LLMs) with effective space learning, focusing on the analysis and evaluation of emotional features in spoken language. Addressing the limitation of current spoken English evaluation software that primarily focuses on acoustic features of speech (such as pronunciation, frequency, and prosody) while neglecting emotional expression, this paper proposes a method capable of deeply recognizing and evaluating emotional features in speech. The core of the method comprises three main parts: (1) the creation of a comprehensive spoken English emotion evaluation dataset combining emotionally rich speech data synthesized using LLMs with the IEMOCAP dataset and student spoken audio; (2) an emotion feature encoding network based on transformer architecture, dedicated to extracting effective spatial features from audio; (3) an emotion evaluation network for the spoken English language that accurately identifies emotions expressed by Chinese students by analyzing different audio characteristics. By decoupling emotional features from other sound characteristics in spoken English, this study achieves automated emotional evaluation. This method not only provides Chinese students with the opportunity to improve their ability to express emotions in spoken English but also opens new research directions in the fields of spoken English teaching and emotional expression evaluation.
... In contrast, multimodal emotion recognition, by comprehensively analyzing various modalities like voice, image, and text, can capture and interpret an individual's emotional state more holistically. Therefore, multimodal emotion recognition has become one of the hot topics in current research [3,4]. In the field of multimodal emotion analysis, researchers are extensively dedicated to developing integrated fusion mechanisms to combine data from text, vision, and acoustics, aiming for more precise and comprehensive emotional interpretation. ...
Preprint
Full-text available
With technological advancements, we can now capture rich dialogue content, tones, textual information, and visual data through tools like microphones, the internet, and cameras. However, relying solely on a single modality for emotion analysis often fails to accurately reflect the true emotional state, as this approach overlooks the dynamic correlations between different modalities. To address this, our study introduces a multimodal emotion recognition method that combines tensor decomposition fusion and self-supervised multi-task learning. This method first employs Tucker decomposition techniques to effectively reduce the model’s parameter count, lowering the risk of overfitting. Subsequently, by building a learning mechanism for both multimodal and unimodal tasks and incorporating the concept of label generation, it more accurately captures the emotional differences between modalities. We conducted extensive experiments and analyses on public datasets like CMU-MOSI and CMU-MOSEI, and the results show that our method significantly outperforms existing methods in terms of performance. The related code is open-sourced at https://github.com/ZhuJw31/MMER-TD.
... Facial expression recognition generally includes three steps: preprocessing [7], feature extraction [8], and expression classification [9]. Traditional methods for feature extraction implemented designs such as local binary patterns [10], directional gradient histograms [11], and scale-invariant feature transformation [12]. ...
Preprint
Full-text available
This paper proposes an improved strategy for the MobileNetV2 neural network(I-MobileNetV2) in response to problems such as large parameter quantities in existing deep convolutional neural networks and the shortcomings of the lightweight neural network MobileNetV2 such as easy loss of feature information, poor real-time performance, and low accuracy rate in facial emotion recognition tasks. The network inherits the characteristics of MobilenetV2 depthwise separated convolution, signifying a reduction in computational load while maintaining a lightweight profile. It utilizes a reverse fusion mechanism to retain negative features, which makes the information less likely to be lost. The SELU activation function is used to replace the RELU6 activation function to avoid gradient vanishing. Meanwhile, to improve the feature recognition capability, the channel attention mechanism (Squeeze-and-Excitation Networks (SE-Net)) is integrated into the MobilenetV2 network. Experiments conducted on the facial expression datasets FER2013 and CK + showed that the proposed network model achieved facial expression recognition accuracies of 68.62% and 95.96%, improving upon the MobileNetV2 model by 0.72% and 6.14% respectively, and the parameter count decreased by 83.8%. These results empirically verify the effectiveness of the improvements made to the network model.
Article
Full-text available
This paper proposes an improved strategy for the MobileNetV2 neural network(I-MobileNetV2) in response to problems such as large parameter quantities in existing deep convolutional neural networks and the shortcomings of the lightweight neural network MobileNetV2 such as easy loss of feature information, poor real-time performance, and low accuracy rate in facial emotion recognition tasks. The network inherits the characteristics of MobilenetV2 depthwise separated convolution, signifying a reduction in computational load while maintaining a lightweight profile. It utilizes a reverse fusion mechanism to retain negative features, which makes the information less likely to be lost. The SELU activation function is used to replace the RELU6 activation function to avoid gradient vanishing. Meanwhile, to improve the feature recognition capability, the channel attention mechanism (Squeeze-and-Excitation Networks (SE-Net)) is integrated into the MobilenetV2 network. Experiments conducted on the facial expression datasets FER2013 and CK + showed that the proposed network model achieved facial expression recognition accuracies of 68.62% and 95.96%, improving upon the MobileNetV2 model by 0.72% and 6.14% respectively, and the parameter count decreased by 83.8%. These results empirically verify the effectiveness of the improvements made to the network model.
Article
Full-text available
Practical demands and academic challenges have both contributed to making sentiment analysis a thriving area of research. Given that a great deal of sentiment analysis work is performed on social media communications, where text frequently ignores the rules of grammar and spelling, pre-processing techniques are required to clean the data. Pre-processing is also required to normalise the text before undertaking the analysis, as social media is inundated with abbreviations, emoticons, emojis, truncated sentences, and slang. While pre-processing has been widely discussed in the literature, and it is considered indispensable, recommendations for best practice have not been conclusive. Thus, we have reviewed the available research on the subject and evaluated various combinations of pre-processing components quantitatively. We have focused on the case of Twitter sentiment analysis, as Twitter has proved to be an important source of publicly accessible data. We have also assessed the effectiveness of different combinations of pre-processing components for the overall accuracy of a couple of off-the-shelf tools and one algorithm implemented by us. Our results confirm that the order of the pre-processing components matters and significantly improves the performance of naïve Bayes classifiers. We also confirm that lemmatisation is useful for enhancing the performance of an index, but it does not notably improve the quality of sentiment analysis.
Article
Full-text available
Affective computing, a subcategory of artificial intelligence, detects, processes, interprets, and mimics human emotions. Thanks to the continued advancement of portable non-invasive human sensor technologies, like brain–computer interfaces (BCI), emotion recognition has piqued the interest of academics from a variety of domains. Facial expressions, speech, behavior (gesture/posture), and physiological signals can all be used to identify human emotions. However, the first three may be ineffectual because people may hide their true emotions consciously or unconsciously (so-called social masking). Physiological signals can provide more accurate and objective emotion recognition. Electroencephalogram (EEG) signals respond in real time and are more sensitive to changes in affective states than peripheral neurophysiological signals. Thus, EEG signals can reveal important features of emotional states. Recently, several EEG-based BCI emotion recognition techniques have been developed. In addition, rapid advances in machine and deep learning have enabled machines or computers to understand, recognize, and analyze emotions. This study reviews emotion recognition methods that rely on multi-channel EEG signal-based BCIs and provides an overview of what has been accomplished in this area. It also provides an overview of the datasets and methods used to elicit emotional states. According to the usual emotional recognition pathway, we review various EEG feature extraction, feature selection/reduction, machine learning methods (e.g., k-nearest neighbor), support vector machine, decision tree, artificial neural network, random forest, and naive Bayes) and deep learning methods (e.g., convolutional and recurrent neural networks with long short term memory). In addition, EEG rhythms that are strongly linked to emotions as well as the relationship between distinct brain areas and emotions are discussed. We also discuss several human emotion recognition studies, published between 2015 and 2021, that use EEG data and compare different machine and deep learning algorithms. Finally, this review suggests several challenges and future research directions in the recognition and classification of human emotional states using EEG.
Conference Paper
Full-text available
In this work we tackle the task of video-based audio-visual emotion recognition, within the premises of the 2nd Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW2). Poor illumination conditions, head/body orientation and low image resolution constitute factors that can potentially hinder performance in case of methodologies that solely rely on the extraction and analysis of facial features. In order to alleviate this problem, we leverage both bodily and contextual features, as part of a broader emotion recognition framework. We choose to use a standard CNN-RNN cascade as the backbone of our proposed model for sequence-to-sequence (seq2seq) learning. Apart from learning through the RGB input modality, we construct an aural stream which operates on sequences of extracted mel-spectrograms. Our extensive experiments on the challenging and newly assembled Aff-Wild2 dataset verify the validity of our intuitive multi-stream and multi-modal approach towards emotion recognition in-the-wild. Emphasis is being laid on the the beneficial influence of the human body and scene context, as aspects of the emotion recognition process that have been left relatively unexplored up to this point. All the code was implemented using PyTorch and is publicly available.
Article
Full-text available
Multimodal fusion-based emotion recognition has attracted increasing attention in affective computing because different modalities can achieve information complementation. One of the main challenges for reliable and effective model design is to define and extract appropriate emotional features from different modalities. In this paper, we present a novel multimodal emotion recognition framework to estimate categorical emotions, where visual and audio signals are utilized as multimodal input. The model learns neural appearance and key emotion frame using a statistical geometric method, which acts as a pre-processer for saving computation power. Discriminative emotion features expressed from visual and audio modalities are extracted through evolutionary optimization, and then fed to the optimized extreme learning machine (ELM) classifiers for unimodal emotion recognition. Finally, a decision-level fusion strategy is applied to integrate the results of predicted emotions by the different classifiers to enhance the overall performance. The effectiveness of the proposed method is demonstrated through three public datasets, i.e., the acted CK+ dataset, the acted Enterface05 dataset, and the spontaneous BAUM-1s dataset. An average recognition rate of 93.53\(\%\) on CK+, 91.62\(\%\) on Enterface05, and 60.77\(\%\) on BAUM-1s are obtained. The emotion recognition results acquired by fusing visual and audio predicted emotions are superior to both recognition of unimodality and concatenation of individual features.
Article
Full-text available
Dimensional sentiment analysis aims to recognize continuous numerical values in multiple dimensions such as the valence-arousal (VA) space. Compared to the categorical approach that focuses on sentiment classification such as binary classification (i.e., positive and negative), the dimensional approach can provide a more fine-grained sentiment analysis. This article proposes a tree-structured regional CNN-LSTM model consisting of two parts: regional CNN and LSTM to predict the VA ratings of texts. Unlike a conventional CNN which considers a whole text as input, the proposed regional CNN uses a part of the text as a region, dividing an input text into several regions such that the useful affective information in each region can be extracted and weighted according to their contribution to the VA prediction. Such regional information is sequentially integrated across regions using LSTM for VA prediction. By combining the regional CNN and LSTM, both local (regional) information within sentences and long-distance dependencies across sentences can be considered in the prediction process. To further improve performance, a region division strategy is proposed to discover task-relevant phrases and clauses to incorporate structured information into VA prediction. Experimental results on different corpora show that the proposed method outperforms lexicon-, regression-, conventional NN and other structured NN methods proposed in previous studies.
Article
Full-text available
Nowadays, automatic speech emotion recognition has numerous applications. One of the important steps of these systems is the feature selection step. Because it is not known which acoustic features of person’s speech are related to speech emotion, much effort has been made to introduce several acoustic features. However, since employing all of these features will lower the learning efficiency of classifiers, it is necessary to select some features. Moreover, when there are several speakers, choosing speaker-independent features is required. For this reason, the present paper attempts to select features which are not only related to the emotion of speech, but are also speaker-independent. For this purpose, the current study proposes a multi-task approach which selects the proper speaker-independent features for each pair of classes. The selected features are then given to the classifier. Finally, the outputs of the classifiers are appropriately combined to achieve an output of a multi-class problem. Simulation results reveal that the proposed approach outperforms other methods and offers higher efficiency in terms of detection accuracy and runtime.
Article
Full-text available
Recognition of facial expressions across various actors, contexts, and recording conditions in real-world videos involves identifying local facial movements. Hence, it is important to discover the formation of expressions from local representations captured from different parts of the face. So in this paper, we propose a dynamic kernel-based representation for facial expressions that assimilates facial movements captured using local spatio-temporal representations in a large universal Gaussian mixture model (uGMM). These dynamic kernels are used to preserve local similarities while handling global context changes for the same expression by utilizing the statistics of uGMM. We demonstrate the efficacy of dynamic kernel representation using three different dynamic kernels, namely, explicit mapping based, probability-based, and matching-based, on three standard facial expression datasets, namely, MMI, AFEW, and BP4D. Our evaluations show that probability-based kernels are the most discriminative among the dynamic kernels. However, in terms of computational complexity, intermediate matching kernels are more efficient as compared to the other two representations.
Article
Full-text available
Electroencephalography (EEG) measures the neuronal activities in different brain regions via electrodes. Many existing studies on EEG-based emotion recognition do not fully exploit the topology of EEG channels. In this paper, we propose a regularized graph neural network (RGNN) for EEG-based emotion recognition. RGNN considers the biological topology among different brain regions to capture both local and global relations among different EEG channels. Specifically, we model the inter-channel relations in EEG signals via an adjacency matrix in a graph neural network where the connection and sparseness of the adjacency matrix are inspired by neuroscience theories of human brain organization. In addition, we propose two regularizers, namely node-wise domain adversarial training (NodeDAT) and emotion-aware distribution learning (EmotionDL), to better handle cross-subject EEG variations and noisy labels, respectively. Extensive experiments on two public datasets, SEED and SEED-IV, demonstrate the superior performance of our model than state-of-the-art models in most experimental settings. Moreover, ablation studies show that the proposed adjacency matrix and two regularizers contribute consistent and significant gain to the performance of our RGNN model. Finally, investigations on the neuronal activities reveal important brain regions and inter-channel relations for EEG-based emotion recognition.
Conference Paper
Full-text available
Annotating a qualitative large-scale facial expression dataset is extremely difficult due to the uncertainties caused by ambiguous facial expressions, low-quality facial images, and the subjectiveness of annotators. These uncertainties lead to a key challenge of large-scale Facial Expression Recognition (FER) in deep learning era. To address this problem, this paper proposes a simple yet efficient Self-Cure Network (SCN) which suppresses the uncertainties efficiently and prevents deep networks from over-fitting uncertain facial images. Specifically, SCN suppresses the uncertainty from two different aspects: 1) a self-attention mechanism over mini-batch to weight each training sample with a ranking regularization, and 2) a careful relabeling mechanism to modify the labels of these samples in the lowest-ranked group. Experiments on synthetic FER datasets and our collected WebEmotion dataset validate the effectiveness of our method. Results on public benchmarks demonstrate that our SCN outperforms current state-of-the-art methods with 88.14% on RAF-DB, 60.23% on AffectNet, and 89.35% on FERPlus.
Article
Full-text available
Annotating a qualitative large-scale facial expression dataset is extremely difficult due to the uncertainties caused by ambiguous facial expressions, low-quality facial images, and the subjectiveness of annotators. These uncertainties lead to a key challenge of large-scale Facial Expression Recognition (FER) in deep learning era. To address this problem, this paper proposes a simple yet efficient Self-Cure Network (SCN) which suppresses the uncertainties efficiently and prevents deep networks from over-fitting uncertain facial images. Specifically, SCN suppresses the uncertainty from two different aspects: 1) a self-attention mechanism over mini-batch to weight each training sample with a ranking regularization, and 2) a careful relabeling mechanism to modify the labels of these samples in the lowest-ranked group. Experiments on synthetic FER datasets and our collected WebEmotion dataset validate the effectiveness of our method. Results on public benchmarks demonstrate that our SCN outperforms current state-of-the-art methods with 88.14% on RAF-DB, 60.23% on AffectNet, and 89.35% on FERPlus.
Article
Capturing the dynamics of facial expression progression in video is an essential and challenging task for facial expression recognition (FER). In this paper, we propose an effective framework to address this challenge. We develop a C3D-based network architecture, 3D-Inception-ResNet, to extract spatial-temporal features from the dynamic facial expression image sequence. A Spatial-Temporal and Channel Attention Module (STCAM) is proposed to explicitly exploit the holistic spatial-temporal and channel-wise correlations among the extracted features. Specifically, the proposed STCAM calculates a channel-wise and a spatial-temporal-wise attention map to enhance the features along the corresponding feature dimensions for more representative features. We evaluate our method on three popular dynamic facial expression recognition datasets, CK+, Oulu-CASIA and MMI. Experimental results show that our method achieves better or comparable performance compared to the state-of-the-art approaches.
Article
Emotion identification based on multimodal data (e.g., audio, video, text, etc.) is one of the most demanding and important research fields, with various uses. In this context, this research work has conducted a rigorous exploration of model-level fusion to find out the optimal multimodal model for emotion recognition using audio and video modalities. More specifically, separate novel feature extractor networks for audio and video data are proposed. After that, an optimal multimodal emotion recognition model is created by fusing audio and video features at the model level. The performances of the proposed models are assessed based on two benchmark multimodal datasets namely Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and Surrey Audio-Visual Expressed Emotion (SAVEE) using various performance metrics. The proposed models achieve high predictive accuracies of 99% and 86% on the SAVEE and RAVDESS datasets, respectively. The effectiveness of the models are also verified by comparing their performances with the existing emotion recognition models. Some case studies are also conducted to explore the model’s ability to capture the variability of emotional states of the speakers in publicly available real-world audio-visual media.
Article
Multi-modal speech emotion recognition is a study to predict emotion categories by combining speech data with other types of data, such as video, speech text transcription, body action, or facial expression when speaking, which will involve the fusion of multiple features. Most of the early studies, however, directly spliced multi-modal features in the fusion layer after single-modal modeling, resulting in ignoring the connection between speech and other modal features. As a result, we propose a novel multi-modal speech emotion recognition model based on multi-head attention fusion networks, which employs transcribed text and motion capture (MoCap) data involving facial expression, head rotation, and hand action to supplement speech data and perform emotion recognition. In unimodal, we use a two-layer Transformer’s encoder combination model to extract speech and text features separately, and MoCap is modeled using a deep residual shrinkage network. Simultaneously, We innovated by changing the input of the Transformer encoder to learn the similarities between speech and text, speech and MoCap, and then output text and MoCap features that are more similar to speech features, and finally, predict the emotion category using combined features. In the IEMOCAP dataset, our model outperformed earlier research with a recognition accuracy of 75.6%.
Article
The complete acoustic features include magnitude and phase information. However, traditional speech emotion recognition methods only focus on the magnitude information and ignore the phase data, and will inevitably miss some information. This study explores the accurate extraction and effective use of phase features for speech emotion recognition. First, the reflection of speech emotion in the phase spectrum is analyzed, and a quantitative analysis shows that phase data contain information that can be used to distinguish emotions. A dynamic relative phase (DRP) feature extraction method is then proposed to solve the problem in which the original relative phase (RP) has difficulty determining the base frequency and further alleviating the dependence of the phase on the clipping position of the frame. Finally, a single-channel model (SCM) and a multi-channel model with an attention mechanism (MCMA) are constructed to effectively integrate the phase and magnitude information. By introducing phase information, more complete acoustic features are captured, which enriches the emotional representations. The experiments were conducted using the Emo-DB and IEMOCAP databases. Experimental results demonstrate the effectiveness of the proposed DRP for speech emotion recognition, as well as the complementarity between the phase and magnitude information in speech emotion recognition.
Article
Facial expression recognition (FER) has received significant attention in the past decade with witnessed progress, but data inconsistencies among different FER datasets greatly hinder the generalization ability of the models learned on one dataset to another. Recently, a series of cross-domain FER algorithms (CD-FERs) have been extensively developed to address this issue. Although each declares to achieve superior performance, comprehensive and fair comparisons are lacking due to inconsistent choices of the source/target datasets and feature extractors. In this work, we first propose to construct a unified CD-FER evaluation benchmark, in which we re-implement the well-performing CD-FER and recently published general domain adaptation algorithms and ensure that all these algorithms adopt the same source/target datasets and feature extractors for fair CD-FER evaluations. We find that most of the current state-of-the-art algorithms use adversarial learning mechanisms that aim to learn holistic domain-invariant features to mitigate domain shifts. Therefore, we develop a novel adversarial graph representation adaptation (AGRA) framework that integrates graph representation propagation with adversarial learning to realize effective cross-domain holistic-local feature co-adaptation. We conduct extensive and fair comparisons on the unified evaluation benchmark and show that the proposed AGRA framework outperforms previous state-of-the-art methods.
Article
Facial expression is a powerful, natural, and universal signal for human beings to convey their emotional states and intentions. Numerous studies have been conducted on automatic facial expression analysis because of its practical importance in sociable robotics, medical treatment, driver fatigue surveillance, and many other human-computer interaction systems. Various facial expression recognition (FER) systems have been explored to encode expression information from facial representations in the field of computer vision and machine learning. Traditional methods typically use handcrafted features or shallow learning for FER. However, related studies have collected training samples from challenging real-world scenarios, which implicitly promote the transition of FER from laboratory-controlled to in-the-wild settings since 2013. Meanwhile, studies in various fields have increasingly used deep learning methods, which achieve state-of-the-art recognition accuracy and remarkably exceed the results of previous investigations due to considerably improved chip processing abilities (e.g., GPU units) and appropriately designed network architectures. Moreover, deep learning techniques are increasingly utilized to handle challenging factors for emotion recognition in the wild because of the effective training of facial expression data. The transition of facial expression recognition from being laboratory-controlled to challenging in-the-wild conditions and the recent success of deep learning techniques in various fields have promoted the use of deep neural networks to learn discriminative representations for automatic FER. Recent deep FER systems generally focus on the following important issues. 1) Deep neural networks require a large amount of training data to avoid overfitting. However, existing facial expression databases are insufficient for training common neural networks with deep architecture, which achieve promising results in object recognition tasks. 2) Expression-unrelated variations are common in unconstrained facial expression scenarios, such as illumination, head pose, and identity bias. These disturbances are nonlinearly confounded with facial expressions and therefore strengthen the requirement of deep networks to address the large intraclass variability and learn effective expression-specific representations. We provide a comprehensive review of deep FER, including datasets and algorithms that provide insights into these intrinsic problems, in this survey. First, we introduce the background of fields of FER and summarize the development of available datasets widely used in the literature as well as FER algorithms in the past 10 years. Second, we divide the FER system into two main categories according to feature representations, namely, static image and dynamic sequence FER. The feature representation in static-based methods is encoded with only spatial information from the current single image, whereas dynamic-based methods consider temporal relations among contiguous frames in input facial expression sequences. On the basis of these two vision-based methods, other modalities, such as audio and physiological channels, have also been used in multimodal sentiment analysis systems to assist in FER. Although pure expression recognition based on visible face images can achieve promising results, incorporating it with other models into a high-level framework can provide complementary information and further enhance the robustness. We introduce existing novel deep neural networks and related training strategies, which are designed for FER based on both static and dynamic image sequences, and discuss their advantages and limitations in state-of-the-art deep FER. Competitive performance and experimental comparisons of these deep FER systems in widely used benchmarks are also summarized. We then discuss relative advantages and disadvantages of these different types of methods with respect to two open issues (data size requirement and expression-unrelated variations) and other focuses (computation efficiency, performance, and network training difficulty). Finally, we review and summarize the following challenges in this field and future directions for the design of robust deep FER systems. 1) Lacking training data in terms of both quantity and quality is a main challenge in deep FER systems. Abundant sample images with diverse head poses and occlusions as well as precise face attribute labels, including expression, age, gender, and ethnicity, are crucial for practical applications. The crowdsourcing model under the guidance of expert annotators is a reasonable approach for massive annotations. 2) Data bias and inconsistent annotations are very common among different facial expression datasets due to various collecting conditions and the subjectiveness of annotating. Furthermore, the FER performance fails to improve when training data is enlarged by directly merging multiple datasets due to inconsistent expression annotations. Cross-database performance is an important evaluation criterion of generalizability and practicability of FER systems. Deep domain adaption and knowledge distillation are promising trends to address this bias. 3) Another common issue is imbalanced class distribution in facial expression due to the practicality of sample acquirement. One solution is to resample and balance the class distribution on the basis of the number of samples for each class during the preprocessing stage using data augmentation and synthesis. Another alternative is to develop a cost-sensitive loss layer for reweighting during network work training. 4) Although FER within the categorical model has been extensively investigated, the definition of prototypical expressions covers only a small portion of specific categories and cannot capture the full repertoire of expressive behavior for realistic interactions. Incorporating other affective models, such as FACS(facial action coding system) and dimensional models, can facilitate the recognition of facial expressions and allow them to learn expression-discriminative representations. 5) Human expressive behavior in realistic applications involves encoding from different perspectives, with facial expressions as only one modality. Although pure expression recognition based on visible face images can achieve promising results, incorporating it with other models into a high-level framework can provide complementary information and further enhance the robustness. For example, the fusion of other modalities, such as the audio information, infrared images, and depth information from 3D face models and physiological data, has become a promising research direction due to the large complementarity of facial expressions and the good application value of human-computer interaction (HCI) applications. © 2020, Editorial and Publishing Board of Journal of Image and Graphics. All right reserved.
Article
This is a survey paper that aims to give reviews about that finest architectures of machine learning, the use of algorithms and the applications of the system and speech and vision processes. The current technology poses vast areas of research in the field of architectures and algorithms which are used in machine learning, they can further be used for the planning new ideas and recreation of speech system and vision systems intelligently. The level of personal computing and its commercialization is at an all-time high. The machine learning can used for learning and training using large sets of sensor data and computing through clouds, also not to forget the mobile and embedded technology which is even more sophisticated. The survey is presented by giving a detailed background and the evolving nature of the most used models of deep learning which are used effectively in the vision and speech systems. The survey aims to give a perspective into the large scale research at the industrial level, it also highlights the efforts taken for future focus and upcoming demands of the intelligent use of speech and vision processes. The demand in a strong, robust and high intelligent system is that of lower latency and fidelity to be high. This is mostly seen in hardware devices with limited resources like automobiles, robots and mobile phones. These points are kept in mind and a detail of the major challenges and success rates especially in machine learning with platforms that have limited resources. The restrictions are in memory, life of the battery and the capacity of processing. The conclusion is based on the applications which are fast emerging based on the usage of speech and vision systems. Examples include effective evaluation, smart and quick system transportation and correct medicine prescriptions. This paper promise to deliver a comprehensive survey emphasizing on the demands of speech and vision systems with the view of both hardware and software systems. The technologies which are discussed in machine learning are fast gaining access and aim to revolutionize the areas of research and development in speech and vision systems.
Article
Link for paper: https://www.techrxiv.org/articles/preprint/Survey_of_Deep_Representation_Learning_for_Speech_Emotion_Recognition/16689484 ============================================================= Traditionally, speech emotion recognition (SER) research has relied on manually handcrafted acoustic features using feature engineering. However, the design of handcrafted features for complex SER tasks requires significant manual eort, which impedes generalisability and slows the pace of innovation. This has motivated the adoption of representation learning techniques that can automatically learn an intermediate representation of the input signal without any manual feature engineering. Representation learning has led to improved SER performance and enabled rapid innovation. Its effectiveness has further increased with advances in deep learning (DL), which has facilitated \textit{deep representation learning} where hierarchical representations are automatically learned in a data-driven manner. This paper presents the first comprehensive survey on the important topic of deep representation learning for SER. We highlight various techniques, related challenges and identify important future areas of research. Our survey bridges the gap in the literature since existing surveys either focus on SER with hand-engineered features or representation learning in the general setting without focusing on SER.
Article
Convolutional Neural Networks (CNNs) have achieved remarkable performance breakthroughs in a variety of tasks. Recently, CNN-based methods that are fed with hand-extracted EEG features have steadily improved their performance on the emotion recognition task. In this paper, we propose a novel convolutional layer, called the Scaling Layer, which can adaptively extract effective data-driven spectrogram-like features from raw EEG signals. Furthermore, it exploits convolutional kernels scaled from one data-driven pattern to exposed a frequency-like dimension to address the shortcomings of prior methods requiring hand-extracted features or their approximations. ScalingNet, the proposed neural network architecture based on the Scaling Layer, has achieved state-of-the-art results across the established DEAP and AMIGOS benchmark datasets.
Article
There has been an encouraging progress in the affective states recognition models based on the single-modality signals as electroencephalogram (EEG) signals or peripheral physiological signals in recent years. However, multimodal physiological signals-based affective states recognition methods have not been thoroughly exploited yet. Here we propose Multiscale Convolutional Neural Networks (Multiscale CNNs) and a biologically inspired decision fusion model for multimodal affective states recognition. Firstly, the raw signals are pre-processed with baseline signals. Then, the High Scale CNN and Low Scale CNN in Multiscale CNNs are utilized to predict the probability of affective states output for EEG and each peripheral physiological signal respectively. Finally, the fusion model calculates the reliability of each single-modality signals by the Euclidean distance between various class labels and the classification probability from Multiscale CNNs, and the decision is made by the more reliable modality information while other modalities information is retained. We use this model to classify four affective states from the arousal valence plane in the DEAP and AMIGOS dataset. The results show that the fusion model improves the accuracy of affective states recognition significantly compared with the result on single-modality signals, and the recognition accuracy of the fusion result achieve 98.52% and 99.89% in the DEAP and AMIGOS dataset respectively.
Article
Emotion recognition from electroencephalograph (EEG) signals has long been essential for affective computing. In this paper, we evaluate the EEG emotion recognition by converting EEG signals of the multiple channels into images such that richer spatial information can be considered and the question of EEG-based emotion recognition can be converted into image recognition. To this end, we propose a novel method to generate the continuous images from the discrete EEG signals by introducing offset variables following Gaussian distribution for each EEG channel to alleviate the imprecise electrode coordinates during the image generation. In addition, a novel graph embedded convolutional neural network (GECNN) method is proposed to combine the local convolutional neural network (CNN) features with global functional features, so as to provide the complementary emotion information. In GECNN, the attention mechanism is applied to extract more discriminative local features. Simultaneously, the dynamical graph filtering explores the intrinsic relationships between different EEG regions. The local and global functional features are finally fused for emotion recognition. Extensive experiments in subject-dependent and subject-independent protocols are conducted to evaluate the performance of the proposed GECNN model on SEED, SDEA, DREAMER and MPED datasets.
Article
Emotion recognition in conversation (ERC) is an important research topic in artificial intelligence. Different from the emotion estimation in individual utterances, ERC requires proper handling of human interactions in conversations. Several approaches have been proposed for ERC and achieved promising results. In this paper, we propose a correction model for previous approaches, called “Dialogical Emotion Correction Network (DECN)”. This model aims to automatically correct some errors made by emotion recognition strategies and further improve the recognition performance. Specifically, DECN employs a graphical network to model human interactions in conversations. To further utilize the contextual information, DECN also employs the multi-head attention based bi-directional GRU component. Since DECN is a correction model for ERC, it can be easily integrated with any emotion recognition strategy. Experimental results on the IEMOCAP and MELD datasets verify the effectiveness of our proposed method. DECN can improve the performance of emotion recognition strategies with few parameters and low computational complexity.
Article
Neuroscience research studies have shown that the left and right hemispheres of the human brain response differently to the same or different emotions. Exploiting this difference in the human brain response is important to emotion recognition. In this study, we propose a bi-hemisphere discrepancy convolutional neural network model (BiDCNN) for electroencephalograph (EEG) emotion recognition, which can effectively learn the different response patterns between the left and right hemispheres, and is designed as a three-input and single-output network structure with three convolutional neural network layers. Specifically, to capture and amplify different electrical responses of the left and right brain to emotional stimuli, three different EEG feature matrices are constructed based on the International 10-20 System. Subsequently, by using three convolutional neural network layers, the spatial and temporal features are extracted to mine the inter-channel correlation among the physically adjacent EEG electrodes. We evaluate our proposed BiDCNN model on the classical dataset called DEAP to verify its effectiveness. Our results of the subject-dependent experiment show that BiDCNN achieves state-of-the-art performance with a mean accuracy of 94.38% in valence and 94.72% in arousal, where the data for training and testing come from one subject. Furthermore, our subject-independent experimental results, in which different subjects are used to train and test the model, show that BiDCNN also obtains superior results on the valence and arousal recognition tasks, respectively achieving an accuracy of 68.14% and 63.94%.
Article
We present a learning model for multimodal context-aware emotion recognition. Our approach combines multiple human co-occurring modalities(such as facial, audio, textual, and pose/gaits) and two interpretations of context. To gather and encode background semantic information for the first context interpretation from the input image/video we use a self-attention based CNN to encode. Similarly, for modeling the socio-dynamic interactions among people(second context interpretation) in the input image/video we use depth maps. We use multiplicative fusion to combine the modality and context channels, which learn to focus on the more informative input channels and suppress others for every incoming datapoint. We demonstrate the efficiency of our model on 4 benchmark emotion recognition datasets(IEMOCAP, CMU-MOSEI, EMOTIC and GroupWalk). Our model outperforms on SOTA learning methods with an average 5-9% increase over all the datasets. We also perform ablation studies to motivate the importance of multi-modality, context and multiplicative fusion.
Article
Unlike the conventional facial expressions, micro-expressions are involuntary and transient facial expressions capable of revealing the genuine emotions that people attempt to hide. Therefore, they can provide important information in a broad range of applications such as lie detection, criminal detection, etc. Since micro-expressions are transient and of low intensity, however, their detection and recognition is difficult and relies heavily on expert experiences. Due to its intrinsic particularity and complexity, video-based micro-expression analysis is attractive but challenging, and has recently become an active area of research. Although there have been numerous developments in this area, thus far there has been no comprehensive survey that provides researchers with a systematic overview of these developments with a unified evaluation. Accordingly, in this survey paper, we first highlight the key differences between macro- and micro-expressions, then use these differences to guide our research survey of video-based micro-expression analysis in a cascaded structure, encompassing the neuropsychological basis, datasets, features, spotting algorithms, recognition algorithms, applications and evaluation of state-of-the-art approaches. For each aspect, the basic techniques, advanced developments and major challenges are addressed and discussed. Furthermore, after considering the limitations of existing micro-expression datasets, we present and release a new dataset — called micro-and-macro expression warehouse (MMEW) — containing more video samples and more labeled emotion types. We then perform a unified comparison of representative methods on CAS(ME) $^2$ for spotting, and on MMEW and SAMM for recognition, respectively. Finally, some potential future research directions are explored and outlined.
Article
Speech emotion recognition is an important but difficult task in human-computer interaction systems. One of the main challenges in speech emotion recognition is how to extract effective emotion features from a long utterance. To address this issue, we propose a novel spatiotemporal and frequential cascaded attention network with large-margin learning in this paper. Spatiotemporal attention selectively locates the targeted emotional regions from a long speech spectrogram. In these targeted regions, frequential attention captures the emotional features by frequency distribution. The cascaded attention assists the neural network to gradually extract effective emotion features from the long spectrogram. During training, large-margin learning is applied to improve intra-class compactness and enlarge inter-class distances. Experiments on four public datasets demonstrate that our proposed model achieves a promising performance in speech emotion recognition.
Article
The existed methods for electroencephalograph (EEG) emotion recognition always train the models based on all the EEG samples indistinguishably. However, some of the source (training) samples may lead to a negative influence because they are significant dissimilar with the target (test) samples. So it is necessary to give more attention to the EEG samples with strong transferability rather than forcefully training a classification model by all the samples. Furthermore, for an EEG sample, from the aspect of neuroscience, not all the brain regions of an EEG sample contain emotional information that can transferred to the test data effectively. Even some brain region data will make strong negative effect for learning the emotional classification model. Considering these two issues, in this paper, we propose a transferable attention neural network (TANN) for EEG emotion recognition, which learns the emotional discriminative information by highlighting the transferable EEG brain regions data and samples adaptively through local and global attention mechanism. This can be implemented by measuring the outputs of multiple brain-region-level discriminators and one single sample-level discriminator. Extensive experiments on EEG emotion recognition demonstrate that the proposed TANN is superior to those state-of-the-art methods.
Article
Long short-term memory (LSTM) neural networks and attention mechanism have been widely used in sentiment representation learning and detection of texts. However, most of the existing deep learning models for text sentiment analysis ignore emotion's modulation effect on sentiment feature extraction, and the attention mechanisms of these deep neural network architectures are based on word- or sentence-level abstractions. Ignoring higher level abstractions may pose a negative effect on learning text sentiment features and further degrade sentiment classification performance. To address this issue, in this article, a novel model named AEC-LSTM is proposed for text sentiment detection, which aims to improve the LSTM network by integrating emotional intelligence (EI) and attention mechanism. Specifically, an emotion-enhanced LSTM, named ELSTM, is first devised by utilizing EI to improve the feature learning ability of LSTM networks, which accomplishes its emotion modulation of learning system via the proposed emotion modulator and emotion estimator. In order to better capture various structure patterns in text sequence, ELSTM is further integrated with other operations, including convolution, pooling, and concatenation. Then, topic-level attention mechanism is proposed to adaptively adjust the weight of text hidden representation. With the introduction of EI and attention mechanism, sentiment representation and classification can be more effectively achieved by utilizing sentiment semantic information hidden in text topic and context. Experiments on real-world data sets show that our approach can improve sentiment classification performance effectively and outperform state-of-the-art deep learning-based methods significantly.
Article
As an important branch of affective computing, Speech Emotion Recognition (SER) plays a vital role in human-computer interaction. In order to mine the relevance of signals in audios an increase the diversity of information, Bi-directional Long-Short Term Memory with Directional Self-Attention (BLSTM-DSA) is proposed in this paper. Long Short-Term Memory (LSTM) can learn long-term dependencies from learned local features. Moreover, Bi-directional Long-Short Term Memory (BLSTM) can make the structure more robust by direction mechanism because that the directional analysis can better recognize the hidden emotions in sentence. At the same time, autocorrelation of speech frames can be used to deal with the lack of information, so that Self-Attention mechanism is introduced into SER. The attention weight of each frame is calculated with the output of the forward and backward LSTM respectively rather than calculated after adding them together. Thus, the algorithm can automatically annotate the weights of speech frames to correctly select frames with emotional information in temporal network. When evaluate it on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database and Berlin database of emotional speech (EMO-DB), the BLSTM-DSA demonstrates satisfactory performance on the task of speech emotion recognition. Especially in emotion recognizing of happiness and anger, BLSTM-DSA achieves the highest recognition accuracies.
Article
Automated emotion recognition in the wild from facial images remains a challenging problem. Although recent advances in deep learning have assumed a significant breakthrough in this topic, strong changes in pose, orientation, and point of view severely harm current approaches. In addition, the acquisition of labeled datasets is costly and the current state-of-the-art deep learning algorithms cannot model all the aforementioned difficulties. In this article, we propose applying a multitask learning loss function to share a common feature representation with other related tasks. Particularly, we show that emotion recognition benefits from jointly learning a model with a detector of facial action units (collective muscle movements). The proposed loss function addresses the problem of learning multiple tasks with heterogeneously labeled data, improving previous multitask approaches. We validate the proposal using three datasets acquired in noncontrolled environments, and an application to predict compound facial emotion expressions.
Article
Multimodal emotion recognition is a challenging task due to different modalities of emotions expressed during a specific time in video clips. Considering the existed spatial-temporal correlation in the video, we propose an audio-visual fusion model of deep learning features with a Mixture of Brain Emotional Learning (MoBEL) model inspired by the brain limbic system. The proposed model is composed of two stages. First, deep learning methods, especially Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN), are applied to represent highly abstract features. Second, the fusion model, namely MoBEL, is designed to learn the previously joined audio-visual features simultaneously. For the visual modality representation, the 3D-CNN model has been used to learn the spatial-temporal features of visual expression. On the other hand, for the auditory modality, the Mel-spectrograms of speech signals have been fed into CNN-RNN for the spatial-temporal feature extraction. The high-level feature fusion approach with the MoBEL network is presented to make use of a correlation between the visual and auditory modalities for improving the performance of emotion recognition. The experimental results on the eNterface’05 database have been demonstrated that the performance of the proposed method is better than the hand-crafted features and the other state-of-the-art information fusion models in video emotion recognition.
Article
This paper proposes a new deep learning architecture for context-based multi-label multi-task emotion recognition. The architecture is built from three main modules: (1) a body features extraction module, which is a pre-trained Xception network, (2) a scene features extraction module, based on a modified VGG16 network, and (3) a fusion-decision module. Moreover, three categorical and three continuous loss functions are compared in order to point out the importance of the synergy between loss functions when it comes to multi-task learning. Then, we propose a new loss function, the multi-label focal loss (MFL), based on the focal loss to deal with imbalanced data. Experimental results on EMOTIC dataset show that MFL with the Huber loss gave better results than any other combination and outperformed the current state of art on the less frequent labels.
Article
Composite-database micro-expression recognition is attracting increasing attention as it is more practical for real-world applications. Though the composite database provides more sample diversity for learning good representation models, the important subtle dynamics are prone to disappearing in the domain shift such that the models greatly degrade their performance, especially for deep models. In this paper, we analyze the influence of learning complexity, including input complexity and model complexity, and discover that the lower-resolution input data and shallower-architecture model are helpful to ease the degradation of deep models in composite-database task. Based on this, we propose a recurrent convolutional network (RCN) to explore the shallower-architecture and lower-resolution input data, shrinking model and input complexities simultaneously. Furthermore, we develop three parameter-free modules (i.e., wide expansion, shortcut connection and attention unit) to integrate with RCN without increasing any learnable parameters. These three modules can enhance the representation ability in various perspectives while preserving not-very-deep architecture for lower-resolution data. Besides, three modules can further be combined by an automatic strategy (a neural architecture search strategy) and the searched architecture becomes more robust. Extensive experiments on the MEGC2019 dataset (composited of existing SMIC, CASME II and SAMM datasets) have verified the influence of learning complexity and shown that RCNs with three modules and the searched combination outperform the state-of-the-art approaches.
Article
Most previous EEG-based emotion recognition methods studied hand-crafted EEG features extracted from different electrodes. In this article, we study the relation among different EEG electrodes and propose a deep learning method to automatically extract the spatial features that characterize the functional relation between EEG signals at different electrodes. Our proposed deep model is called AT tention-based LSTM with D omain D iscriminator (ATDD-LSTM), a model based on Long Short-Term Memory (LSTM) for emotion recognition that can characterize nonlinear relations among EEG signals of different electrodes. To achieve state-of-the-art emotion recognition performance, the architecture of ATDD-LSTM has two distinguishing characteristics: (1) By applying the attention mechanism to the feature vectors produced by LSTM, ATDD-LSTM automatically selects suitable EEG channels for emotion recognition, which makes the learned model concentrate on the emotion related channels in response to a given emotion; (2) To minimize the significant feature distribution shift between different sessions and/or subjects, ATDD-LSTM uses a domain discriminator to modify the data representation space and generate domain-invariant features. We evaluate the proposed ATDD-LSTM model on three public EEG emotional databases (DEAP, SEED and CMEED) for emotion recognition. The experimental results demonstrate that our ATDD-LSTM model achieves superior performance on subject-dependent (for the same subject), subject-independent (for different subjects) and cross-session (for the same subject) evaluation.
Article
Emotions composed of cognizant logical reactions toward various situations. Such mental responses stem from physiological, cognitive, and behavioral changes. Electroencephalogram (EEG) signals provide a noninvasive and nonradioactive solution for emotion identification. Accurate and automatic classification of emotions can boost the development of human-computer interface. This article proposes automatic extraction and classification of features through the use of different convolutional neural networks (CNNs). At first, the proposed method converts the filtered EEG signals into an image using a time-frequency representation. Smoothed pseudo-Wigner-Ville distribution is used to transform time-domain EEG signals into images. These images are fed to pretrained AlexNet, ResNet50, and VGG16 along with configurable CNN. The performance of four CNNs is evaluated by measuring the accuracy, precision, Mathew's correlation coefficient, F1-score, and false-positive rate. The results obtained by evaluating four CNNs show that configurable CNN requires very less learning parameters with better accuracy. Accuracy scores of 90.98%, 91.91%, 92.71%, and 93.01% obtained by AlexNet, ResNet50, VGG16, and configurable CNN show that the proposed method is best among other existing methods.
Article
In recent years, deep learning (DL) techniques, and in particular convolutional neural networks (CNNs), have shown great potential in electroencephalograph (EEG)-based emotion recognition. However, existing CNN-based EEG emotion recognition methods usually require a relatively complex stage of feature pre-extraction. More importantly, the CNNs cannot well characterize the intrinsic relationship among the different channels of EEG signals , which is essentially a crucial clue for the recognition of emotion. In this paper, we propose an effective multi-level features guided capsule network (MLF-CapsNet) for multi-channel EEG-based emotion recognition to overcome these issues. The MLF-CapsNet is an end-to-end framework, which can simultaneously extract features from the raw EEG signals and determine the emotional states. Compared with original CapsNet, it incorporates multi-level feature maps learned by different layers in forming the primary capsules so that the capability of feature representation can be enhanced. In addition, it uses a bottleneck layer to reduce the amount of parameters and accelerate the speed of calculation. Our method achieves the average accuracy of 97.97%, 98.31% and 98.32% on valence, arousal and dominance of DEAP dataset, respectively, and 94.59%, 95.26% and 95.13% on valence, arousal and dominance of DREAMER dataset, respectively. These results show that our method exhibits higher accuracy than the state-of-the-art methods.
Article
Multimodal human sentiment comprehension refers to recognizing human affection from multiple modalities. There exist two key issues for this problem. Firstly, it is difficult to explore time-dependent interactions between modalities and focus on the important time steps. Secondly, processing the long fused sequence of utterances is susceptible to the forgetting problem due to the long-term temporal dependency. In this paper, we introduce a hierarchical learning architecture to classify utterance-level sentiment. To address the first issue, we perform time-step level fusion to generate fused features for each time step, which explicitly models time-restricted interactions by incorporating information across modalities at the same time step. Furthermore, based on the assumption that acoustic features directly reflect emotional intensity, we pioneer emotion intensity attention to focus on the time steps where emotion changes or intense affections take place. To handle the second issue, we propose Residual Memory Network (RMN) to process the fused sequence. RMN utilizes some techniques such as directly passing the previous state into the next time step, which helps to retain the information from many time steps ago. We show that our method achieves state-of-the-art performance on multiple datasets. Results also suggest that RMN yields competitive performance on sequence modeling tasks.
Article
Recent deep neural networks based methods have achieved state-of-the-art performance on various facial expression recognition tasks. Despite such progress, previous researches for facial expression recognition have mainly focused on analyzing color recording videos only. However, the complex emotions expressed by people with different skin colors under different lighting conditions through dynamic facial expressions can be fully understandable by integrating information from multimodal videos. We present a novel method to estimate dimensional emotion states, where color, depth, and thermal recording videos are used as a multi-modal input. Our networks, called multimodal recurrent attention networks (MRAN), learn spatiotemporal attention volumes to robustly recognize the facial expression based on attention-boosted feature volumes. We leverage the depth and thermal sequences as guidance priors for color sequence to selectively focus on emotional discriminative regions. We also introduce a novel benchmark for multi-modal facial expression recognition, termed as multi-modal arousal-valence facial expression recognition (MAVFER), which consists of color, depth, and thermal recording videos with corresponding continuous arousal-valence scores. The experimental results show that our method can achieve the state-of-the-art results in dimensional facial expression recognition on color recording datasets including RECOLA, SEWA and AFEW, and a multimodal recording dataset including MAVFER.
Article
Different from many other attributes, facial expression can change in a continuous way, and therefore, a slight semantic change of input should also lead to the output fluctuation limited in a small scale. This consistency is important. However, current Facial Expression Recognition (FER) datasets may have the extreme imbalance problem, as well as the lack of data and the excessive amounts of noise, hindering this consistency and leading to a performance decreasing when testing. In this paper, we not only consider the prediction accuracy on sample points, but also take the neighborhood smoothness of them into consideration, focusing on the stability of the output with respect to slight semantic perturbations of the input. A novel method is proposed to formulate semantic perturbation and select unreliable samples during training, reducing the bad effect of them. Experiments show the effectiveness of the proposed method and state-of-the-art results are reported, getting closer to an upper limit than the state-of-the-art methods by a factor of 30% in AffectNet, the largest in-the-wild FER database by now.
Article
Inspite the emerging importance of Speech Emotion Recognition (SER), the state-of-the-art accuracy is quite low and needs improvement to make commercial applications of SER viable. A key underlying reason for the low accuracy is the scarcity of emotion datasets, which is a challenge for developing any robust machine learning model in general. In this paper, we propose a solution to this problem: a multi-task learning framework that uses auxiliary tasks for which data is abundantly available. We show that utilisation of this additional data can improve the primary task of SER for which only limited labelled data is available. In particular, we use gender identifications and speaker recognition as auxiliary tasks, which allow the use of very large datasets, e.g., speaker classification datasets. To maximise the benefit of multi-task learning, we further use an adversarial autoencoder (AAE) within our framework, which has a strong capability to learn powerful and discriminative features. Furthermore, the unsupervised AAE in combination with the supervised classification networks enables semi-supervised learning which incorporates a discriminative component in the AAE unsupervised training pipeline. The proposed model is rigorously evaluated for categorical and dimensional emotion, and cross-corpus scenarios. Experimental results demonstrate that the proposed model achieves state-of-the-art performance on two publicly available dataset.
Article
The speech emotion recognition (or, classification) is one of the most challenging topics in data science. In this work, we introduce a new architecture, which extracts mel-frequency cepstral coefficients, chromagram, mel-scale spectrogram, Tonnetz representation, and spectral contrast features from sound files and uses them as inputs for the one-dimensional Convolutional Neural Network for the identification of emotions using samples from the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), Berlin (EMO-DB), and Interactive Emotional Dyadic Motion Capture (IEMOCAP) datasets. We utilize an incremental method for modifying our initial model in order to improve classification accuracy. All of the proposed models work directly with raw sound data without the need for conversion to visual representations, unlike some previous approaches. Based on experimental results, our best-performing model outperforms existing frameworks for RAVDESS and IEMOCAP, thus setting the new state-of-the-art. For the EMO-DB dataset, it outperforms all previous works except one but compares favorably with that one in terms of generality, simplicity, and applicability. Specifically, the proposed framework obtains 71.61% for RAVDESS with 8 classes, 86.1% for EMO-DB with 535 samples in 7 classes, 95.71% for EMO-DB with 520 samples in 7 classes, and 64.3% for IEMOCAP with 4 classes in speaker-independent audio classification tasks.
Article
Driven by recent advances in human-centered computing, Facial Expression Recognition (FER) has attracted significant attention in many applications. However, most conventional approaches either perform face frontalization on a non-frontal facial image or learn separate classifier for each pose. Different from existing methods, this paper proposes an end-to-end deep learning model that allows to simultaneous facial image synthesis and pose-invariant facial expression recognition by exploiting shape geometry of the face image. The proposed model is based on generative adversarial network (GAN) and enjoys several merits. First, given an input face and a target pose and expression designated by a set of facial landmarks, an identity-preserving face can be generated through guiding by the target pose and expression. Second, the identity representation is explicitly disentangled from both expression and pose variations through the shape geometry delivered by facial landmarks. Third, our model can automatically generate face images with different expressions and poses in a continuous way to enlarge and enrich the training set for the FER task. Our approach is demonstrated to perform well when compared with state-of-the-art algorithms on both controlled and in-the-wild benchmark datasets including Multi-PIE, BU-3DFE, and SFEW.
Article
In recent years, the rapid advances in machine learning (ML) and information fusion has made it possible to endow machines/computers with the ability of emotion understanding, recognition, and analysis. Emotion recognition has attracted increasingly intense interest from researchers from diverse fields. Human emotions can be recognized from facial expressions, speech, behavior (gesture/posture) or physiological signals. However, the first three methods can be ineffective since humans may involuntarily or deliberately conceal their real emotions (so-called social masking). The use of physiological signals can lead to more objective and reliable emotion recognition. Compared with peripheral neurophysiological signals, electroencephalogram (EEG) signals respond to fluctuations of affective states more sensitively and in real time and thus can provide useful features of emotional states. Therefore, various EEG-based emotion recognition techniques have been developed recently. In this paper, the emotion recognition methods based on multi-channel EEG signals as well as multi-modal physiological signals are reviewed. According to the standard pipeline for emotion recognition, we review different feature extraction (e.g., wavelet transform and nonlinear dynamics), feature reduction, and ML classifier design methods (e.g., k-nearest neighbor (KNN), naive Bayesian (NB), support vector machine (SVM) and random forest (RF)). Furthermore, the EEG rhythms that are highly correlated with emotions are analyzed and the correlation between different brain areas and emotions is discussed. Finally, we compare different ML and deep learning algorithms for emotion recognition and suggest several open problems and future research directions in this exciting and fast-growing area of AI.
Article
An ensemble visual-audio emotion recognition framework is proposed based on multi-task and blending learning with multiple features in this paper. To solve the problem that existing features can not accurately identify different emotions, we extract two kinds features, i. e., Interspeech 2010 and deep features for audio data, LBP and deep features for visual data, with the intent to accurately identify different emotions by using different features. Owing to the diversity of features, SVM classifiers and CNN are designed for manual features, i.e., Interspeech 2010 features and local LBP features, and deep features, through which four sub-models are obtained. Finally, the blending ensemble algorithm is used to fuse sub-models to improve the recognition performance of visual-audio emotion recognition. In addition, multi-task learning is applied in the CNN model for deep features, which can predict multiple tasks at the same time with fewer parameters and improve the sensitivity of the single recognition model to user’s emotion by sharing information between different tasks. Experiments are performed using eNTERFACCE database, from which the results indicate that the recognition of multi-task CNN increased by 3% and 2% on average over CNN model in speaker-independent and speaker-dependent experiments, respectively. And emotion recognition accuracy of visual-audio by our method reaches 81.36% and 78.42% in speaker-independent and speaker-dependent experiments, respectively, which maintain higher performance than some state-of-the-art works.
Article
The advancement of Human-Robot Interaction (HRI) drives research into the development of advanced emotion identification architectures that fathom audio-visual (A-V) modalities of human emotion. State-of-the-art methods in multi-modal emotion recognition mainly focus on the classification of complete video sequences, leading to systems with no online potentialities. Such techniques are capable of predicting emotions only when the videos are concluded, thus restricting their applicability in practical scenarios. The paper at hand provides a novel paradigm for online emotion classification, which exploits both audio and visual modalities and produces a responsive prediction when the system is confident enough. We propose two deep Convolutional Neural Network (CNN) models for extracting emotion features, one for each modality, and a Deep Neural Network (DNN) for their fusion. In order to conceive the temporal quality of human emotion in interactive scenarios, we train in cascade a Long Short-Term Memory (LSTM) layer and a Reinforcement Learning (RL) agent -which monitors the speaker- thus stopping feature extraction and making the final prediction. The comparison of our results on two publicly available A-V emotional datasets viz., RML and BAUM-1s, against other state-of-the-art models, demonstrates the beneficial capabilities of our work.