Figure 1 - uploaded by Dong-Yan Huang
Content may be subject to copyright.
Overview of the proposed emotion recognition system. The face video is provided by challenge organizers. For audio emotion recognition, we assume the non-vocal signals in movies also can contribute to emotion recognition; thus we do not preprocess vocal and non-vocal signal segments.  

Overview of the proposed emotion recognition system. The face video is provided by challenge organizers. For audio emotion recognition, we assume the non-vocal signals in movies also can contribute to emotion recognition; thus we do not preprocess vocal and non-vocal signal segments.  

Source publication
Conference Paper
Full-text available
This paper presents the techniques used in our contribution to Emotion Recognition in the Wild 2016’s video based sub-challenge. The purpose of the sub-challenge is to classify the six basic emotions (angry, sad, happy, surprise, fear & disgust) and neutral. Compared to earlier years’ movie based datasets, this year’s test dataset introduced realit...

Similar publications

Conference Paper
Full-text available
This paper presents the techniques used in our contribution to Emotion Recognition in the Wild 2017 video based sub-challenge. The purpose of the sub-challenge is to classify the six basic emotions (angry, sad, happy, surprise, fear and disgust) and neutral. Our proposed solution utilizes three state-of-the-arts techniques to overcome the challenge...

Citations

... This multimodal approach has achieved notable performance in capturing emotions. Likewise, in the context of in-the-wild facial emotion recognition from videos, Ding et al. [23] have addressed the challenge by incorporating audio emotion recognition alongside facial analysis using deep neural networks. The integration of both subsystems has proved to be efficient, resulting in state-of-the-art performance. ...
... Alternative machine learning techniques have also been scrutinized in the quest for accurate emotion recognition inthe-wild. Logistic regression and partial least squares are notable examples of such techniques that have been explored for FER [23]. These methodologies, while distinct from deep neural networks, hold their own advantages and appeal for certain scenarios. ...
Article
Full-text available
Facial expressions are a crucial aspect of human communication that provide information about emotions, intentions, interactions, and social relationships. They are a universal signal used daily to convey inner behaviors in natural situations. With the increasing interest in automatic facial emotion recognition, deep neural networks have become a popular tool for recognizing emotions in challenging in-the-wild conditions that are closer to reality. However, these systems must contend with external factors that degrade the quality of facial features, making it challenging to determine the correct emotion classes. In this paper, we first provide a summary of the various fields that use facial recognition systems under in-the-wild context. Then, we extensively examine the major datasets utilized for in-the-wild facial expression recognition, taking into account their appropriateness for this context, the challenges related to their application, the coverage of various emotions, and the potential domains of application. The analysis is conducted rigorously, emphasizing the merits and demerits of each dataset and advocating for their pertinence and effectiveness in real-life situations. We also present an expanded taxonomy of facial emotion recognition in-the-wild, while focusing mainly on deep learning methods and covering the manufacturing steps of a facial emotion recognition system and the different possible techniques for each step. Finally, we provide a discussion, insights, and conclusion, making this survey a reference point for researchers interested in the in-the-wild context, while providing a better understanding of the different datasets’ compositions and specificities. This survey can help advance research on deep facial emotion recognition in-the-wild and serve as a resource for methods, applications, and datasets in the field.
... In our future work, we will extend the proposed framework to account for the speaker-independent scenario. We will also validate the framework on the emotion recognition in the wild datasets such as AFEW [26] and ERIW [27] datasets. ...
Conference Paper
Full-text available
Emotion recognition, an important research problem in human-robot interactions, is primarily achieved by extracting human emotions from audio and visual data. State-of-the-art performance is reported by audiovisual sensor fusion algorithms using deep learning models such as CNN, RNN, and LSTM. However, the RNN and LSTM are shown to be limited in handling the long-term dependencies over the entire input sequence. In this work, we propose to improve the performance of audiovisual emotion recognition using a novel transformer-based model, containing three transformer branches, named multimodal transformers. The three transformer branches, in our work, compute the audio self-attention, the video self-attention, and the audio-video cross attention. The self-attention branches identify the most relevant information in the audio and video input, while the cross-attention branch identifies the most relevant audio-video interactive information. The relevant information from these three branches report the best performance in our ablation study. We also propose a novel temporal embedding scheme, termed block embedding, to add the temporal information to the visual feature, derived from the multiple frames in the video. The proposed architecture is validated using the RAVDESS, CREMA-D, and SAVEE audiovisual public datasets. A detailed ablation study and comparative analysis with baseline models is performed. The results show that the proposed multi-modal transformer framework is better than the baseline methods.
... In the literature, video sequences are mostly processed using aggregation and concatenation of features for each facial image into clips. Yet, they cannot leverage the temporal dependencies over a video [26,27]. To circumvent this issue, convolutional recurrent neural networks (CRNN) and 3D-CNN architectures have been proposed to encode spatiotemporal relations among video frames [28,29]. ...
... A conventional approach for dealing with video frames is to aggregate features extracted from each frame into a clip before final emotion prediction Ding et al. [27]. In addition to feature aggregation, Bargal et al. [26] also aggregated the mean, variance, minimum and maximum over a sequence of features, thus adding some statistical information. ...
Article
Full-text available
Facial expressions are one of the most powerful ways to depict specific patterns in human behavior and describe the human emotional state. However, despite the impressive advances of affective computing over the last decade, automatic video-based systems for facial expression recognition still cannot correctly handle variations in facial expression among individuals as well as cross-cultural and demographic aspects. Nevertheless, recognizing facial expressions is a difficult task, even for humans. This paper investigates the suitability of state-of-the-art deep learning architectures based on convolutional neural networks (CNNs) to deal with long video sequences captured in the wild for continuous emotion recognition. For such an aim, several 2D CNN models that were designed to model spatial information are extended to allow spatiotemporal representation learning from videos, considering a complex and multi-dimensional emotion space, where continuous values of valence and arousal must be predicted. We have developed and evaluated convolutional recurrent neural networks, combining 2D CNNs and long short term-memory units and inflated 3D CNN models, which are built by inflating the weights of a pre-trained 2D CNN model during fine-tuning, using application-specific videos. Experimental results on the challenging SEWA-DB dataset have shown that these architectures can effectively be fine-tuned to encode spatiotemporal information from successive raw pixel images and achieve state-of-the-art results on such a dataset.
... Generally, training a deep network with hundreds of training data will result in overfitting. Therefore, we need methods to 350 train with small datasets like [48] or use data augmentation methods to create a large training dataset. Traditional augmentation methods include horizontal flipping, color space augmentations, and random cropping. ...
Article
Full-text available
Convolutional Neural Networks (CNNs) have achieved great success in learning computer vision tasks, particularly 3D CNNs, for extracting spatio-temporal features from the given videos. However, 3D CNNs have not been well-examined for the Stereoscopic Video Quality Assessment (SVQA). To our best knowledge, most of the state-of-the-art methods used the traditional hand-crafted feature extraction methods for the SVQA. Very few methods used the power of deep learning for SVQA, and they just considered the spatial information, ignoring the disparity and motion information. In this paper, we propose a No-Reference (NR) deep 3D CNN architecture that jointly focuses on spatial, disparity, and temporal information between consecutive frames. A 3-Stream 3D CNN, shortly 3S-3DCNN, by performing 3D CNNs, extracts features from spatial, motion, and depth channels to estimate the stereo video’s quality. It captures the degradations in the quality of the stereoscopic video in multiple dimensions. Firstly, the scene flow, which is the joint prediction of the optical flow and stereo disparity, is calculated. Then, the spatial information, optical flow, and disparity map of a given video are used as input to the 3S-3DCNN model. The extracted features are concatenated and utilized as inputs to the fully connected layers for doing the regression. We split the input videos into cube patches for data augmentation and remove the cubes that confuse our model from the training and testing sets. Two standard stereoscopic video quality assessment benchmarks of LFOVIAS3DPh2 and NAMA3DS1-COSPAD1 were used to evaluate our method. Experimental results show that our 3S-3DCNN method’s objective score significantly correlates with the subjective SVQ scores in multiple video datasets. The RMSE for NAMA3DS1-COSPAD1 dataset is 0.2757, which outperforms other methods by a large margin. The SROCC value for the blur distortion of the LFOVIAS3DPh2 dataset is more than 98%, indicating that the 3S-3DCNN is consistent with human visual perception.
... This suggests that deep and handcrafted features are complementary. As of this writing, the findings of most studies doing related comparisons support this assumption [96], [97], [98], [99], though in many cases a comparison is difficult as reported accuracies for well-performing fusion approaches include features from other modalities (see Section 3.3). Only one study reported that deep and handcrafted features are not complementary [100]. ...
... One study suggested that a smaller window size (2s) could be the best choice [65]. In general, the addition of RNN for temporal modeling is associated with an increase in model accuracy (e.g., [97], [129], [162]). When dealing with dimensional labels, this allows learning features at the frame level. ...
Article
Automatic human affect recognition is a key step towards more natural human-computer interaction. Recent trends include recognition in the wild using a fusion of audiovisual and physiological sensors, a challenging setting for conventional machine learning algorithms. Since 2010, novel deep learning algorithms have been applied increasingly in this field. In this paper, we review the literature on human affect recognition between 2010 and 2017, with a special focus on approaches using deep neural networks. By classifying a total of 950 studies according to their usage of shallow or deep architectures, we are able to show a trend towards deep learning. Reviewing a subset of 233 studies that employ deep neural networks, we comprehensively quantify their applications in this field. We find that deep learning is used for learning of (i) spatial feature representations, (ii) temporal feature representations, and (iii) joint feature representations for multimodal sensor data. Exemplary state-of-the-art architectures illustrate the progress. Our findings show the role deep architectures will play in human affect recognition, and can serve as a reference point for researchers working on related applications.
... The authors constructed the concise feature set named GeMAPS, which involves 62 audio features. Recent studies show that fusing audio and visual information results in more accurate models compared to models that rely on a single modality [40]. In [41] energy and spectral audio features are fused with visual information for emotion prediction in short video clips, while in [42] the authors exploit audiovisual data to train a deep neural network for recognizing affect in real-world environments. ...
Preprint
What if emotion could be captured in a general and subject-agnostic fashion? Is it possible, for instance, to design general-purpose representations that detect affect solely from the pixels and audio of a human-computer interaction video? In this paper we address the above questions by evaluating the capacity of deep learned representations to predict affect by relying only on audiovisual information of videos. We assume that the pixels and audio of an interactive session embed the necessary information required to detect affect. We test our hypothesis in the domain of digital games and evaluate the degree to which deep classifiers and deep preference learning algorithms can learn to predict the arousal of players based only on the video footage of their gameplay. Our results from four dissimilar games suggest that general-purpose representations can be built across games as the arousal models obtain average accuracies as high as 85% using the challenging leave-one-video-out cross-validation scheme. The dissimilar audiovisual characteristics of the tested games showcase the strengths and limitations of the proposed method.
... An extension on face emotion analysis is proposed on [69] using an audio spectrogram and human face image based on an integrated multimodal architecture. The multimodal approaches [11,13,41,44] have proposed audio and video by using a recurrent network with LSTM cells for face video emotion recognition. The one-dimensional (1D) audio network-and 2D video network-based multimodal [61] for speech recognition uses hybrid information fusion techniques by adding recurrent neural network after concatenation of learned features. ...
... There have been three information fusion methods including early, late and hybrid fusion. As in [11,41,69], the multimodal fusion provides the benefits of robustness, complementary information gain and functional continuity of system even in the failure of one or more modalities. Early (or feature-level) fusion integrates low-level features from each modality by correlation, which potentially accomplishes better task, but has difficulty in temporal synchronization among various input sources. ...
Article
Full-text available
Affective computing is an emerging area of research that aims to enable intelligent systems to recognize, feel, infer and interpret human emotions. The widely spread online and off-line music videos are one of the rich sources of human emotion analysis because it integrates the composer’s internal feeling through song lyrics, musical instruments performance and visual expression. In general, the metadata which music video customers to choose a product includes high-level semantics like emotion so that automatic emotion analysis might be necessary. In this research area, however, the lack of a labeled dataset is a major problem. Therefore, we first construct a balanced music video emotion dataset including diversity of territory, language, culture and musical instruments. We test this dataset over four unimodal and four multimodal convolutional neural networks (CNN) of music and video. First, we separately fine-tuned each pre-trained unimodal CNN and test the performance on unseen data. In addition, we train a 1-dimensional CNN-based music emotion classifier with raw waveform input. The comparative analysis of each unimodal classifier over various optimizers is made to find the best model that can be integrate into a multimodal structure. The best unimodal modality is integrated with corresponding music and video network features for multimodal classifier. The multimodal structure integrates whole music video features and makes final classification with the SoftMax classifier by a late feature fusion strategy. All possible multimodal structures are also combined into one predictive model to get the overall prediction. All the proposed multimodal structure uses cross-validation to overcome the data scarcity problem (overfitting) at the decision level. The evaluation results using various metrics show a boost in the performance of the multimodal architectures compared to each unimodal emotion classifier. The predictive model by integration of all multimodal structure achieves 88.56% in accuracy, 0.88 in f1-score, and 0.987 in area under the curve (AUC) score. The result suggests human high-level emotions are automatically well classified in the proposed CNN-based multimodal networks, even though a small amount of labeled data samples is available for training.
... A classical approach for dealing with image sequences is to aggregate per frame deep features on a whole clip before final emotion prediction, such as in Ding et al. (2016). In addition to the feature aggregation, Bargal et al. (2016) concatenated mean, variance, (Warr et al., 2014) minimum and maximum over sequence feature vector thus bringing statistical information. ...
Preprint
Full-text available
The attention in affect computing and emotion recognition has increased in the last decade. Facial expressions are one of the most powerful ways for depicting specific patterns in human behavior and describing human emotional state. Nevertheless, even for humans, identifying facial expressions is difficult, and automatic video-based systems for facial expression recognition (FER) have often suffered from variations in expressions among individuals, and from a lack of diverse and cross-culture training datasets. However, with video sequences captured in-the-wild and more complex emotion representation such as dimensional models, deep FER systems have the ability to learn more discriminative feature representations. In this paper, we present a survey of the state-of-the-art approaches based on convolutional neural networks (CNNs) for long video sequences recorded with in-the-wild settings, by considering the continuous emotion space of valence and arousal. Since few studies have used 3D-CNN for FER systems and dimensional representation of emotions, we propose an inflated 3D-CNN architecture, allowing for weight inflation of pre-trained 2D-CNN model, in order to operate the essential transfer learning for our video-based application. As a baseline, we also considered a 2D-CNN architecture cascaded network with a long short term memory network, therefore we could finally conclude with a model comparison over two approaches for spatiotemporal representation of facial features and performing the regression of valence/arousal values for emotion prediction. The experimental results on RAF-DB and SEWA-DB datasets have shown that these fine-tuned architectures allow to effectively encode the spatiotemporal information from raw pixel images, and achieved far better results than the current state-of-the-art.
... Authors are also developing temporal models for audio modality. One of them is Wan Ding [10] who mixes Low level Descriptors (LLDs) with Recurrent Neural Networks (RNN). ...
... Regarding video images, emotion recognition is traditionally related to facial expression [15]. We can divide publications following the criteria of static [10] and dynamic inputs, which are interesting for this present work. CNNs have exhibited promising performance on feature learning from images but recently we denote an increasing use of RNN to quantify visual motion. ...
... The same conclusion is adopted by [11] which to solve it, they implement Deep Belief Networks (DBN) as fusion method. Publications [17], [10], [18], [11] use models architectures based on twostream convolution network, with a final classifier after fully connected layers such as SVM or DBN, obtaining better performance than single (isolated) data modalities. ...
Chapter
Full-text available
Automatic Emotion recognition is renowned for being a difficult task, even for human intelligence. Due to the importance of having enough data in classification problems, we introduce a framework developed with the purpose of generating labeled audio to create our own database. In this paper we present a new model for audio-video emotion recognition using Transfer Learning (TL). The idea is to combine a pre-trained high level feature extractor Convolutional Neural Network (CNN) and a Bidirectional Recurrent Neural Network (BRNN) model to address the issue of variable sequence length inputs. Throughout the design process we discuss the main problems related to the high complexity of the task due to its inherent subjective nature and, on the other hand, the important results obtained by testing the model on different databases, outperforming the state-of-the-art algorithms in the SAVEE [3] database. Furthermore, we use the mentioned application to perform precision classification (per user) into low resources real scenarios with promising results.
... In addition to the visual content, the tone and acoustic patterns of the speech (e.g., speech speed, loudness, pitch of voice etc.) provide useful cues toward the expression and recognition of emotions. We extracted the audio features using the AcousEmo tool, which was developed by a research programme centered on acoustic emotion recognition [57][58][59][60][61][62]. All AcousEmo audio features are derived based on the acoustic signals of the audio data, without using any cues from the actual speech content or semantics. ...
Preprint
Full-text available
Emotional expressions form a key part of user behavior on today's digital platforms. While multimodal emotion recognition techniques are gaining research attention, there is a lack of deeper understanding on how visual and non-visual features can be used in better recognizing emotions for certain contexts, but not others. This study analyzes the interplay between the effects of multimodal emotion features derived from facial expressions, tone and text in conjunction with two key contextual factors: 1) the gender of the speaker, and 2) the duration of the emotional episode. Using a large dataset of more than 2500 manually annotated videos from YouTube, we found that while multimodal features consistently outperformed bimodal and unimodal features, their performances varied significantly for different emotions, gender and duration contexts. Multimodal features were found to perform particularly better for male than female speakers in recognizing most emotions except for fear. Furthermore, multimodal features performed particularly better for shorter than for longer videos in recognizing neutral, happiness, and surprise, but not sadness, anger, disgust and fear. These findings offer new insights towards the development of more context-aware emotion recognition and empathetic systems.