Fig 2 - uploaded by Khiet Truong
Content may be subject to copyright.
Frequency of emotion category labels: the grey areas indicate the amount of speech segments that could be associated with an emotion 'click' 

Frequency of emotion category labels: the grey areas indicate the amount of speech segments that could be associated with an emotion 'click' 

Source publication
Conference Paper
Full-text available
We developed acoustic and lexical classifiers, based on a boosting algorithm, to assess the separability on arousal and valence dimensions in spontaneous emotional speech. The spontaneous emotional speech data was acquired by inviting subjects to play a first-person shooter video game. Our acoustic classifiers performed significantly better than th...

Citations

... Some emotions, especially emotions that are high in arousal, such as anger and fear, can be better identified from spoken than from written data (e.g. Truong & Raaijmakers, 2008). ...
Article
Full-text available
Background: Identifying and addressing hotspots is a key element of imaginal exposure in Brief Eclectic Psychotherapy for PTSD (BEPP). Research shows that treatment effectiveness is associated with focusing on these hotspots and that hotspot frequency and characteristics may serve as indicators for treatment success. Objective: This study aims to develop a model to automatically recognize hotspots based on text and speech features, which might be an efficient way to track patient progress and predict treatment efficacy. Method: A multimodal supervised classification model was developed based on analog tape recordings and transcripts of imaginal exposure sessions of 10 successful and 10 non-successful treatment completers. Data mining and machine learning techniques were used to extract and select text (e.g. words and word combinations) and speech (e.g. speech rate, pauses between words) features that distinguish between ‘hotspot’ (N = 37) and ‘non-hotspot’ (N = 45) phases during exposure sessions. Results: The developed model resulted in a high training performance (mean F1-score of 0.76) but a low testing performance (mean F1-score = 0.52). This shows that the selected text and speech features could clearly distinguish between hotspots and non-hotspots in the current data set, but will probably not recognize hotspots from new input data very well. Conclusions: In order to improve the recognition of new hotspots, the described methodology should be applied to a larger, higher quality (digitally recorded) data set. As such this study should be seen mainly as a proof of concept, demonstrating the possible application and contribution of automatic text and audio analysis to therapy process research in PTSD and mental health research in general.
... Some previous research have tried to use extra information sources to resolve this issue. For example, the lexical features were sometimes combined to improve the performance [17,18]. In this work, we follow this path. ...
... However, by using this method, the correlation coefficient was only improved by 0.02 in the valence dimension, similar to other works, which indicates that it's not of much use for dimensional emotion recognition. [18] used pitch-, intensity-, and spectrum-related features as acoustic features, and N-gram and speech rate as lexical features. They combined the features on the feature level to train a boosting algorithm, and hypothesized that the fusion between the two information sources would improve the classification performance. ...
Article
Full-text available
Human–human interaction consists of various nonverbal behaviors that are often emotion-related. To establish rapport, it is essential that the listener respond to reactive emotion in a way that makes sense given the speaker's emotional state. However, human–robot interactions generally fail in this regard because most spoken dialogue systems play only a question-answer role. Aiming for natural conversation, we examine an emotion processing module that consists of a user emotion recognition function and a reactive emotion expression function for a spoken dialogue system to improve human–robot interaction. For the emotion recognition function, we propose a method that combines valence from prosody and sentiment from text by decision-level fusion, which considerably improves the performance. Moreover, this method reduces fatal recognition errors, thereby improving the user experience. For the reactive emotion expression function, the system's emotion is divided into emotion category and emotion level, which are predicted using the parameters estimated by the recognition function on the basis of distributions inferred from human–human dialogue data. As a result, the emotion processing module can recognize the user's emotion from his/her speech, and expresses a reactive emotion that matches. Evaluation with ten participants demonstrated that the system enhanced by this module is effective to conduct natural conversation.
... Cowie and colleagues used valence-arousal space to model and assess affect from speech [3]. Similar work was presented in [18], where the authors developed acoustics and lexical classifiers to assess the separability on arousal and valence dimensions on natural speech. Later in [6], the authors evaluate the appraisal-based theory to judge emotional effects on vocal expression. ...
Article
Speech Emotion recognition is a challenging research problem with a significant scientific interest. There has been a lot of research and development around this field in the recent times. In this article, we present a study which aims to improve the recognition accuracy of speech emotion recognition using a hierarchical method based on Gaussian Mixture Model and Support Vector Machines for dimensional and continuous prediction of emotions in valence (positive vs negative emotion) and arousal space ( the degree of emotional intensity). According to these dimensions, emotions are categorized into N broad groups. These N groups are further classified into other groups using spectral representation. We verify and compare the functionality of the different proposed multi level models in order to study differential effects of emotional valence and arousal on the recognition of a basic emotion. Experimental studies are performed over the Berlin Emotional database and the Surrey Audio-Visual Expressed Emotion corpus, expressing different emotions, in German and English languages.
... For an overview that includes visual and physiological cues, the reader is referred to Gunes et al. (2011) andNicolaou et al. (2011). In previous research, the 2 dimensions arousal and valence were usually discretized (see e.g., Truong and Raaijmakers, 2008) or used to divide the 2-dimensional space into 4 quadrants of Positive-Active, Positive-Passive, Negative-Active and Negative-Passive emotions. Tato et al. (2002) mapped emotion categories such as angry, happy, neutral, sadness and boredom onto three discrete levels of arousal. ...
... As SVMs (and SVRs) do not naturally take raw text (words) as input, we used lexical features that are based on a continuous representation of the textual input (similar to Truong and Raaijmakers, 2008). The textual input in our case is a manual word-level transcription made by the author herself (but could eventually be made by an ASR system). ...
... (which is in line with Mower et al., 2009). Secondly, the arousal dimension is better modeled by acoustic features, while the valence dimension is better modeled by textual features, see Table 11 re-confirming Grimm et al. (2007a) and Truong and Raaijmakers (2008) (a feature analysis of the acoustic and lexical features is out of scope for the current paper, however, Truong and Raaijmakers, 2008, provide a small feature analysis of the lexical features used although performed with a different learning algorithm). Finally, we note that in general, performance is relatively low, but that the majority of recognizers perform better than the baseline. ...
Article
The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two types of ratings affect the development and performance of automatic emotion recognizers developed with these ratings. A dimensional approach to emotion modeling is adopted: the ratings are based on continuous arousal and valence scales. We describe the TNO-Gaming Corpus that contains spontaneous vocal and facial expressions elicited via a multiplayer videogame and that includes emotion annotations obtained via self-report and observation by outside observers. Comparisons show that there are discrepancies between self-reported and observed emotion ratings which are also reflected in the performance of the emotion recognizers developed. Using Support Vector Regression in combination with acoustic and textual features, recognizers of arousal and valence are developed that can predict points in a 2-dimensional arousal-valence space. The results of these recognizers show that the self-reported emotion is much harder to recognize than the observed emotion, and that averaging ratings from multiple observers improves performance.
... One popular boosting algorithm that has been the topic of much research is AdaBoost (Schapire and Singer, 2000). BoosTexter has been used extensively in work on conversation or dialogue analysis, polarity detection and speech processing (see for example Raaijmakers et al. 2008;Wilson et al. 2005a;Truong and Raaijmakers 2008). Specifically, for the task of classification with social tags it was successfully evaluated the music domain. ...
Thesis
This thesis describes work on using content to improve recommendation systems. Personalised recommendations help potential buyers filter information and identify products that they might be interested in. Current recommender systems are based mainly on collaborative filtering (CF) methods, which suffer from two main problems: (1) the ramp-up problem, where items that do not have a sufficient amount of meta-data associated with them cannot be recommended; and (2) lack of transparency due to the fact that recommendations produced by the system are not clearly explained. In this thesis we tackle both of these problems. We outline a framework for generating more accurate recommendations that are based solely on available textual content or in combination with rating information. In particular, we show how content in the form of social tags can help improve recommendations in the book and movie domains. We address the ramp-up problem and show how in cases where they do not exist, social tags can be automatically predicted from available textual content, such as the full texts of books. We evaluate our methods using two sets of data that differ in product type and size. Finally we show how once products are selected to be recommended, social tags can be used to explain the recommendations. We conduct a web-based study to evaluate different styles of explanations and demonstrate how tag-based explanations outperform a common CF-based explanation and how a textual review-like explanation yields the best results in helping users predict how much they will like the recommended items.
... whereas in the case of Emotional Prosody, they had to choose between five options (neutral, angry, happy, sad, and unspecified emotion). Second, some of the target emotions are acoustically quite related: angry and happy speech, for example, are both characterised by increased levels of pitch, intensity, and speech rate [28, 29] and show a very similar pattern in terms of spectral properties [30]. Third, listeners may have mutually divergent internal representations of the specific emotions. ...
... Secondly, the deviant results for Emotional Prosody compel us to rethink the perceptual judgement procedure. A new line of thought could be to judge emotions along a broader dimension of arousal and classify utterances as active or passive [28, 29]. Next to that, a break-down analysis per emotion will be carried out to find out possible differences between emotions. ...
Article
Full-text available
This study examines the impact of Parkinson's disease (PD) on communicative efficiency conveyed through prosody. A new assessment method for evaluating productive prosodic skills in Dutch speaking dysarthric patients was devised and tested on 36 individuals (18 controls, 18 PD patients). Three professional listeners judged the intended meanings in four communicative functions of Dutch prosody: Boundary Marking, Focus, Sentence Typing, and Emotional Prosody. Each function was tested through reading and imitation. Interrater agreement was calculated. Results indicated that healthy speakers, compared to PD patients, performed significantly better on imitation of Boundary Marking, Focus, and Sentence Typing. PD patients with a moderate or severe dysarthria performed significantly worse on imitation of Focus than on reading of Focus. No significant differences were found for Emotional Prosody. Judges agreed well on all tasks except Emotional Prosody. Future research will focus on elaborating the assessment and on developing a therapy programme paralleling the assessment.
... In contrast, the approach yields the value of 65.07% using the dataset with only linguistic features. In summary, the fusion yields classification enhancement by absolute value of 3.51%.Truong and Raaijmakers describe an approach to automatic recognition of spontaneous emotions that relies on the acoustic and the lexical modalities ([Truong & Raaijmakers, 2008]). It uses acoustic features (mean, standard deviation, max-min, the averaged slope of pitch and intensity) and lexical features (Ngrams and the speech rate 57 ). ...
Thesis
Full-text available
This dissertation investigates opinion mining and lexical affect sensing. It discusses emotional corpora and describes different approaches to affect categorization of their texts: a statistical approach that utilizes lexical, deictic, stylometric, and grammatical information; a semantic approach that relies on emotional dictionaries and on deep grammatical analysis; and a hybrid approach that combines the statistical approach and the semantic approach. Furthermore, this thesis explores affect sensing using multimodal fusion. In conclusion, the thesis discusses significant contributions and describes future work. Review by Dr. Marina Santini -- http://www.forum.santini.se/2012/11/thesis-review-opinion-mining-and-lexical-affect-sensing/.
... In sports events, the experiencer is often present in the video in the form of the audience, simplifying the task [10]. Valence and arousal expressed in the speech of video game players has also been studied [13]. Here, the experiencer is the game player and the evoker is the game and the other game participants. ...
... Methods for affective modeling use audio alone (e.g., [17]) or audio and video features (e.g., [10]). Closest to our own work are methods combining acoustic and lexical features, whereby acoustic features are more effective for predicting arousal and lexical features for valence [13]. ...
Conference Paper
Full-text available
We carry out two studies on affective state modeling for communication settings that involve unilateral intent on the part of one participant (the evoker) to shift the affective state of another participant (the experiencer). The first investigates viewer response in a narrative setting using a corpus of documentaries annotated with viewer-reported narrative peaks. The second investigates affective triggers in a conversational setting using a corpus of recorded interactions, annotated with continuous affective ratings, between a human interlocutor and an emotionally colored agent. In each case, we build a "one-sided" model using indicators derived from the speech of one participant. Our classification experiments confirm the viability of our models and provide insight into useful features.
... In this experiment we test the ability of models trained on one database to generalize to another one. We use a prosodic, utterance level feature set inspired from the minimum required set of features proposed by [11] and the approach of [20]. The feature set contains: pitch(mean, standard deviation, range, absolute slope (without octave jumps), jitter), intensity (mean, standard deviation, range, absolute slope, shimmer), means of the first 4 formants, long term averaged spectrum (slope, Hammarberg index, high energy) and center of gravity and skewness of the spectrum. ...
Conference Paper
Full-text available
We explore possibilities for enhancing the generality, portability and robustness of emotion recognition systems by combining data-bases and by fusion of classifiers. In a first experiment, we investigate the performance of an emotion detection system tested on a certain database given that it is trained on speech from either the same database, a different database or a mix of both. We observe that generally there is a drop in performance when the test database does not match the training material, but there are a few exceptions. Furthermore, the performance drops when a mixed corpus of acted databases is used for training and testing is carried out on real-life recordings. In a second experiment we investigate the effect of training multiple emotion detectors, and fusing these into a single detection system. We observe a drop in the Equal Error Rate (eer) from 19.0 % on average for 4 individual detectors to 4.2 % when fused using FoCal [1].
... The TNO-GAMING corpus (see also [4, 3, 5] ) contains audiovisual recordings of expressive behavior of subjects (17m/11f) playing a video game (Unreal Tournament). Speech recordings were made with high quality close-talk microphones. ...
Conference Paper
Full-text available
In this paper, we describe emotion recognition experiments car- ried out for spontaneous affective speech with the aim to com- pare the added value of annotation of felt emotion versus an- notation of perceived emotion. Using speech material avail- able in the TNO-GAMING corpus (a corpus containing audio- visual recordings of people playing videogames), speech-based affect recognizers were developed that can predict Arousal and Valence scalar values. Two types of recognizers were devel- oped in parallel: one trained with felt emotion annotations (generated by the gamers themselves) and one trained with perceived/observed emotion annotations (generated by a group of observers). The experiments showed that, in speech, with the methods and features currently used, observed emotions are easier to predict than felt emotions. The results sugges t that recognition performance strongly depends on how and by whom the emotion annotations are carried out. Index Terms: emotion, emotional speech database, emotion recognition