Frequency of emotion category labels: the grey areas indicate the...

Recognizing hotspots in Brief Eclectic Psychotherapy for PTSD by text and audio mining

Article

Full-text available

Dec 2020

Background: Identifying and addressing hotspots is a key element of imaginal exposure in Brief Eclectic Psychotherapy for PTSD (BEPP). Research shows that treatment effectiveness is associated with focusing on these hotspots and that hotspot frequency and characteristics may serve as indicators for treatment success. Objective: This study aims to develop a model to automatically recognize hotspots based on text and speech features, which might be an efficient way to track patient progress and predict treatment efficacy. Method: A multimodal supervised classification model was developed based on analog tape recordings and transcripts of imaginal exposure sessions of 10 successful and 10 non-successful treatment completers. Data mining and machine learning techniques were used to extract and select text (e.g. words and word combinations) and speech (e.g. speech rate, pauses between words) features that distinguish between ‘hotspot’ (N = 37) and ‘non-hotspot’ (N = 45) phases during exposure sessions. Results: The developed model resulted in a high training performance (mean F1-score of 0.76) but a low testing performance (mean F1-score = 0.52). This shows that the selected text and speech features could clearly distinguish between hotspots and non-hotspots in the current data set, but will probably not recognize hotspots from new input data very well. Conclusions: In order to improve the recognition of new hotspots, the described methodology should be applied to a larger, higher quality (digitally recorded) data set. As such this study should be seen mainly as a proof of concept, demonstrating the possible application and contribution of automatic text and audio analysis to therapy process research in PTSD and mental health research in general.

Expressing reactive emotion based on multimodal emotion recognition for natural conversation in human–robot interaction

Article

Full-text available

Sep 2019

Human–human interaction consists of various nonverbal behaviors that are often emotion-related. To establish rapport, it is essential that the listener respond to reactive emotion in a way that makes sense given the speaker's emotional state. However, human–robot interactions generally fail in this regard because most spoken dialogue systems play only a question-answer role. Aiming for natural conversation, we examine an emotion processing module that consists of a user emotion recognition function and a reactive emotion expression function for a spoken dialogue system to improve human–robot interaction. For the emotion recognition function, we propose a method that combines valence from prosody and sentiment from text by decision-level fusion, which considerably improves the performance. Moreover, this method reduces fatal recognition errors, thereby improving the user experience. For the reactive emotion expression function, the system's emotion is divided into emotion category and emotion level, which are predicted using the parameters estimated by the recognition function on the basis of distributions inferred from human–human dialogue data. As a result, the emotion processing module can recognize the user's emotion from his/her speech, and expresses a reactive emotion that matches. Evaluation with ten participants demonstrated that the system enhanced by this module is effective to conduct natural conversation.

Evaluation of Influence of Arousal-Valence Primitives on Speech Emotion Recognition

Article

Sep 2016

Speech Emotion recognition is a challenging research problem with a significant scientific interest. There has been a lot of research and development around this field in the recent times. In this article, we present a study which aims to improve the recognition accuracy of speech emotion recognition using a hierarchical method based on Gaussian Mixture Model and Support Vector Machines for dimensional and continuous prediction of emotions in valence (positive vs negative emotion) and arousal space ( the degree of emotional intensity). According to these dimensions, emotions are categorized into N broad groups. These N groups are further classified into other groups using spectral representation. We verify and compare the functionality of the different proposed multi level models in order to study differential effects of emotional valence and arousal on the recognition of a basic emotion. Experimental studies are performed over the Berlin Emotional database and the Surrey Audio-Visual Expressed Emotion corpus, expressing different emotions, in German and English languages.

Speech-based recognition of self-reported and observed emotion in a dimensional space

Article

Nov 2012
SPEECH COMMUN

The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two types of ratings affect the development and performance of automatic emotion recognizers developed with these ratings. A dimensional approach to emotion modeling is adopted: the ratings are based on continuous arousal and valence scales. We describe the TNO-Gaming Corpus that contains spontaneous vocal and facial expressions elicited via a multiplayer videogame and that includes emotion annotations obtained via self-report and observation by outside observers. Comparisons show that there are discrepancies between self-reported and observed emotion ratings which are also reflected in the performance of the emotion recognizers developed. Using Support Vector Regression in combination with acoustic and textual features, recognizers of arousal and valence are developed that can predict points in a 2-dimensional arousal-valence space. The results of these recognizers show that the self-reported emotion is much harder to recognize than the observed emotion, and that averaging ratings from multiple observers improves performance.

Predicting and using social tags to improve the accuracy and transparency of recommender systems

Thesis

Nov 2011

Sharon Givon

This thesis describes work on using content to improve recommendation systems. Personalised recommendations help potential buyers filter information and identify products that they might be interested in. Current recommender systems are based mainly on collaborative filtering (CF) methods, which suffer from two main problems: (1) the ramp-up problem, where items that do not have a sufficient amount of meta-data associated with them cannot be recommended; and (2) lack of transparency due to the fact that recommendations produced by the system are not clearly explained. In this thesis we tackle both of these problems. We outline a framework for generating more accurate recommendations that are based solely on available textual content or in combination with rating information. In particular, we show how content in the form of social tags can help improve recommendations in the book and movie domains. We address the ramp-up problem and show how in cases where they do not exist, social tags can be automatically predicted from available textual content, such as the full texts of books. We evaluate our methods using two sets of data that differ in product type and size. Finally we show how once products are selected to be recommended, social tags can be used to explain the recommendations. We conduct a web-based study to evaluate different styles of explanations and demonstrate how tag-based explanations outperform a common CF-based explanation and how a textual review-like explanation yields the best results in helping users predict how much they will like the recommended items.

Assessment of Prosodic Communicative Efficiency in Parkinson's Disease As Judged by Professional Listeners

Article

Full-text available

Jan 2011

This study examines the impact of Parkinson's disease (PD) on communicative efficiency conveyed through prosody. A new assessment method for evaluating productive prosodic skills in Dutch speaking dysarthric patients was devised and tested on 36 individuals (18 controls, 18 PD patients). Three professional listeners judged the intended meanings in four communicative functions of Dutch prosody: Boundary Marking, Focus, Sentence Typing, and Emotional Prosody. Each function was tested through reading and imitation. Interrater agreement was calculated. Results indicated that healthy speakers, compared to PD patients, performed significantly better on imitation of Boundary Marking, Focus, and Sentence Typing. PD patients with a moderate or severe dysarthria performed significantly worse on imitation of Focus than on reading of Focus. No significant differences were found for Emotional Prosody. Judges agreed well on all tasks except Emotional Prosody. Future research will focus on elaborating the assessment and on developing a therapy programme paralleling the assessment.

Opinion Mining and Lexical Affect Sensing

Thesis

Full-text available

Dec 2010

Alexander Osherenko

This dissertation investigates opinion mining and lexical affect sensing. It discusses emotional corpora and describes different approaches to affect categorization of their texts: a statistical approach that utilizes lexical, deictic, stylometric, and grammatical information; a semantic approach that relies on emotional dictionaries and on deep grammatical analysis; and a hybrid approach that combines the statistical approach and the semantic approach. Furthermore, this thesis explores affect sensing using multimodal fusion. In conclusion, the thesis discusses significant contributions and describes future work. Review by Dr. Marina Santini -- http://www.forum.santini.se/2012/11/thesis-review-opinion-mining-and-lexical-affect-sensing/.

Towards affective state modeling in narrative and conversational settings

Conference Paper

Full-text available

Sep 2010

We carry out two studies on affective state modeling for communication settings that involve unilateral intent on the part of one participant (the evoker) to shift the affective state of another participant (the experiencer). The first investigates viewer response in a narrative setting using a corpus of documentaries annotated with viewer-reported narrative peaks. The second investigates affective triggers in a conversational setting using a corpus of recorded interactions, annotated with continuous affective ratings, between a human interlocutor and an emotionally colored agent. In each case, we build a "one-sided" model using indicators derived from the speech of one participant. Our classification experiments confirm the viability of our models and provide insight into useful features.

Emotion Recognition from Speech by Combining Databases and Fusion of Classifiers

Conference Paper

Full-text available

Sep 2010

We explore possibilities for enhancing the generality, portability and robustness of emotion recognition systems by combining data-bases and by fusion of classifiers. In a first experiment, we investigate the performance of an emotion detection system tested on a certain database given that it is trained on speech from either the same database, a different database or a mix of both. We observe that generally there is a drop in performance when the test database does not match the training material, but there are a few exceptions. Furthermore, the performance drops when a mixed corpus of acted databases is used for training and testing is carried out on real-life recordings. In a second experiment we investigate the effect of training multiple emotion detectors, and fusing these into a single detection system. We observe a drop in the Equal Error Rate (eer) from 19.0 % on average for 4 individual detectors to 4.2 % when fused using FoCal [1].

Arousal and valence prediction in spontaneous emotional speech: Felt versus perceived emotion

Conference Paper

Full-text available

Sep 2009

In this paper, we describe emotion recognition experiments car- ried out for spontaneous affective speech with the aim to com- pare the added value of annotation of felt emotion versus an- notation of perceived emotion. Using speech material avail- able in the TNO-GAMING corpus (a corpus containing audio- visual recordings of people playing videogames), speech-based affect recognizers were developed that can predict Arousal and Valence scalar values. Two types of recognizers were devel- oped in parallel: one trained with felt emotion annotations (generated by the gamers themselves) and one trained with perceived/observed emotion annotations (generated by a group of observers). The experiments showed that, in speech, with the methods and features currently used, observed emotions are easier to predict than felt emotions. The results sugges t that recognition performance strongly depends on how and by whom the emotion annotations are carried out. Index Terms: emotion, emotional speech database, emotion recognition

Frequency of emotion category labels: the grey areas indicate the amount of speech segments that could be associated with an emotion 'click'

Citations