ArticlePDF Available

Generation of emotions by a morphing technique in English, French and Spanish

January 2002

January 2002

Authors:

Interdisciplinary Laboratory of Digital Sciences

Generating variants becomes a priority for text-to-speech (TTS) synthesis. In particular, additional mark-ups inserted within the text may be used to communicate emotions. Within the framework of a European project linked to the MPEG4 standard (INTERFACE),our purpose is the synthesis of six emotions (anger, disgust, fear, joy, surprise and sadness): this was performed by applying a morphing technique, from the sequence of phonemes and their corresponding prosodic characteristics, for a "neutral" style, generated by a multi-lingual TTS system. We dispose of corpora declined under these six emotions by professional actors in English, French and Spanish: some trends may be drawn, as the inversion of fundamental frequency slopes for disgust and the pruning of melodic movements for sadness. We tend to think that the perceptual identification of the different emotions will be facilitated, within the framework of MPEG4, by the addition of a visual component: a talking head.

example of prosody transplantation for the sentence that keeps Sniffy software under fifty dollars.

…

: F curves produced by the TTS synthesis 0

…

Figures - uploaded by Philippe Boula de Mareüil

Content may be subject to copyright.

Content uploaded by Philippe Boula de Mareüil

Content may be subject to copyright.

A preview of the PDF is not available

Cross-language differences in how voice quality and f 0 contours map to affect

Article

Full-text available

Nov 2018

The relationship between prosody and perceived affect involves multiple variables. This paper explores the interplay of three: voice quality, f0 contour, and the hearer's language background. Perception tests were conducted with speakers of Irish English, Russian, Spanish, and Japanese using three types of synthetic stimuli: (1) stimuli varied in voice quality, (2) stimuli of uniform (modal) voice quality incorporating affect-related f0 contours, and (3) stimuli combining specific non-modal voice qualities with the affect-related f0 contours of (2). The participants rated the stimuli for the presence/strength of affective colouring on six bipolar scales, e.g., happy-sad. The results suggest that stimuli incorporating non-modal voice qualities, with or without f0 variation, are generally more effective in affect cueing than stimuli varying only in f0. Along with similarities in the affective responses across these languages, many points of divergence were found, both in terms of the range and strength of affective responses overall and in terms of specific stimulus-to-affect associations. The f0 contour may play a more important role, and tense voice a lesser role in affect signalling in Japanese and Spanish than in Irish English and Russian. The greatest cross-language differences emerged for the affects intimate, formal, stressed, and relaxed.

Le rôle du maitre dans l’étude de la langue au CP : description et analyse de pratique différenciées

Article

Nov 2017

DAVID: An open-source platform for real-time transformation of infra-segmental emotional cues in running speech

Article

Full-text available

Apr 2017
BEHAV RES METHODS

We present an open-source software platform that transforms emotional cues expressed by speech signals using audio effects like pitch shifting, inflection, vibrato, and filtering. The emotional transformations can be applied to any audio file, but can also run in real time, using live input from a microphone, with less than 20-ms latency. We anticipate that this tool will be useful for the study of emotions in psychology and neuroscience, because it enables a high level of control over the acoustical and emotional content of experimental stimuli in a variety of laboratory situations, including real-time social situations. We present here results of a series of validation experiments aiming to position the tool against several methodological requirements: that transformed emotions be recognized at above-chance levels, valid in several languages (French, English, Swedish, and Japanese) and with a naturalness comparable to natural speech.

Linguistic & Paralinguistic Phonetic Variation in Speaker Recognition & Text-to-Speech Synthesis

Article

Full-text available

Jan 2002

Susanne Schötz

Phonetic variation, and especially prosodic variation, which is often paralinguistic in nature has gradually attracted more attention among speech researchers and speech scientists as one of the possible solutions to problems with automatic speaker recognition (ASrR) and text-to-speech synthesis (TTS) systems. This paper presents a brief overview of approaches to phonetic variation in ASrR and TTS, beginning with attempts to classify linguistic and paralinguistic phenomena in speech. Also, some of the problems related to paralinguistic phonetic variation and attempted solutions are discussed.

CHILD SPEECH AND EMOTIONS: A CROSS-LINGUISTIC PERSPECTIVE

Article

Full-text available

H1/ Abstract The main objective of this research is to investigate the production of affective speech by bilingual and monolingual children cross-linguistically. Cross-linguistic differences in affective speech may lead bilingual children to perceive and to express emotions differently in their two different languages. A cross-linguistically comparable corpus of 8 bilingual Scottish-French children and 16 monolingual peers, average age -8, was recorded according to the developed methodology. This chapter presents preliminary results on pitch range, peak alignment and speech rate for bilingual children and their monolingual peers, comparing their emotions and languages.

Understanding the Use of Acoustic Measurement and Mel Frequency Cepstral Coefficient (MFCC) Features for the Classification of Depression Speech

Chapter

Dec 2023

Depression has been affecting people all around the world, including Malaysians. Early detection mechanisms are vital for assisting clinical professionals in identifying depressed patients at an early stage. Although this can be accomplished through interviews and questionnaires, the time-consuming method has several additional disadvantages. Acoustic Measurement and MFCC have notably been adapted to detect speaker emotion. Numerous researchers have employed various languages for the purpose of prediction. Its efficiency varies across research, although it contributes significantly to diagnosing depression. As it appears that culture diversity influences how emotion is perceived, depression detection mechanism can vary between different languages. This paper provides a comprehensive analysis based on relevant studies published from 2000 to 2023 to show the effectiveness of acoustic measurement and MFCC in depression detection. It was discovered that Support Vector Machine (SVM) is extensively utilised and can successfully contribute to the detection of depressed patients using biometric characteristics. The outcome of this study encourages experimental investigation on the effectiveness of acoustic measuring and MFCC for depression identification among Malaysian speakers.

Roles of Phonation Types and Decoders' Gender in Recognizing Mandarin Emotional Speech

Article

Oct 2023

Purpose Capturing phonation types such as breathy, modal, and pressed voices precisely can facilitate the recognition of human emotions. However, little is known about how exactly phonation types and decoders' gender influence the perception of emotional speech. Based on the modified Brunswikian lens model, this article aims to examine the roles of phonation types and decoders' gender in Mandarin emotional speech recognition by virtue of articulatory speech synthesis. Method Fifty-five participants (28 male and 27 female) completed a recognition task of Mandarin emotional speech, with 200 stimuli representing five emotional categories (happiness, anger, fear, sadness, and neutrality) and five types (original, copied, breathy, modal, and pressed). Repeated-measures analyses of variance were performed to analyze recognition accuracy and confusion data. Results For male and female decoders, the recognition accuracy of anger from pressed stimuli and fear from breathy stimuli was high; across all phonation-type stimuli, the recognition accuracy of sadness was also high, but that of happiness was low. The confusion data revealed that in recognizing fear from all phonation-type stimuli, female decoders chose fear responses more frequently and neutral responses less frequently than male decoders. In recognizing neutrality from breathy stimuli, female decoders significantly reduced their choice of neutral responses and misidentified neutrality as anger, while male decoders mistook neutrality from pressed stimuli for anger. Conclusions This study revealed that, in Mandarin, phonation types play crucial roles in recognizing anger, fear, and neutrality, while the recognition of sadness and happiness seems not to depend heavily on phonation types. Moreover, the decoders' gender affects their recognition of neutrality and fear. These findings support the modified Brunswikian lens model and have significance for diagnosis and intervention among clinical populations with hearing impairment or gender-related psychiatric disorders. Supplemental Material https://doi.org/10.23641/asha.24302221

Design of Videogames in Special Education

Article

Full-text available

Feb 2009

The use of new technological and learning methods that help to improve the learning process has resulted in the inclusion of the video games as active elements in the classrooms. Videogames are ideal learning tools since they provide training skills, promote independence and increase and improve students’ concentration and attention. For special education students with learning difficulties, it is very important to adapt the game to each student’s cognitive level and skills. New game technologies have helped to create alternative strategies to increase cognitive skills in the field of Special Education. This chapter describes our experience in video games design and in new forms of human–computer interaction addressed to develop didactic games for children with communication problems such as autism, dysphasia, ictus or some types of cerebral paralysis.

Expressive speech: Production, perception and application to speech synthesis

Article

Full-text available

Jul 2005
Acoust Sci Tech

Donna Erickson

This paper reviews some of the recent issues and findings in the area of production and perception of expressive speech and the application to speech synthesis. Specifically, it discusses some of the current problems with data collection, labeling, techniques for analyzing voice quality and applying speech synthesis as an analysis tool. Directions for future work in order to improve synthesis of expressive speech are suggested along the lines of better modeling, labeling and voice quality analysis.

Emotional analysis for Malayalam Text to Speech Synthesis Systems

Article

Full-text available

Jan 2007

The inclusion of emotional aspects into speech can improve the naturalness of speech synthesis system. The different emotions -sadness, angry, happiness are manifested in speech as prosodic elements like time duration, pitch and intensity. The prosodic values corresponding to different emotions are analyzed at word as well as phonemic level, using speech analysis and manipulation tool PRAAT. This paper presents the emotional analysis of the prosodic features such as duration, pitch and intensity of Malayalam speech. The analysis shows that duration is generally least for anger and highest for sadness, where as intensity is highest for anger and least for sadness. A new prosodic feature called rise time/fall time which can capture both durational and intensity variation, is introduced. The pitch contour which is flat for neutral speech shows significant variation for different emotions. The detailed analysis considering the duration of different phonemes reveals that the duration variation is significantly more for vowels compared to consonants.

Speech variability and emotion : production and perception

Article

Full-text available

Jan 1998

Sylvie Mozziconacci

Analysis and Modelling of Emotional Speech in Spanish

Article

Full-text available

Jan 1999

L'état émotionnel du locuteur: facteur négligé mais non négligeable pour la technologie de la parole

Article

Full-text available

Jan 1998

This contribution argues that speech technologies, specifically speaker verification, speech recognition and speech synthesis, need to model the effects of speaker state (attitudes and emotions) in order to increase their quality and acceptance. Experimental research on the effects of stress and emotion on voice quality and prosody is reviewed and linked to basic dimensions of speech communication. It is concluded that the current state of the art in this area can provide speech technologists with important leads for modeling affective speaker states. We argue that there is a strong need for increased collaboration between speech engineers and speech scientists from other disciplines.

Intonation and Its Uses: Melody in Grammar and Discourse

Book

Aug 1989

Dwight Bolinger

Emotional speech synthesis: a review

Conference Paper

Sep 2001

Marc Schröder

Attempts to add emotion effects to synthesised speech have existed for more than a decade now. Several prototypes and fully operational systems have been built based on different synthesis techniques, and quite a number of smaller studies have been conducted. This paper aims to give an overview of what has been done in this field, pointing out the inherent properties of the various synthesis techniques used, summarising the prosody rules employed, and taking a look at the evaluation paradigms. Finally, an attempt is made to discuss interesting directions for future development.

Experiments with emotive speech, acted utterances and synthesized replicas

Conference Paper

Oct 1992

Simulated emotions: an acoustic study of voice and perturbation measures

Conference Paper

Nov 1998

Sandra P. Whiteside

Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones

Article

Dec 1990
SPEECH COMMUN

We review in a common framework several algorithms that have been proposed recently, in order to improve the voice quality of a text-to-speech synthesis based on acoustical units concatenation (Charpentier and Moulines, 1988; Moulines and Charpentier, 1988; Hamon et al., 1989). These algorithms rely on a pitch-synchronous overlap-add (PSOLA) approach for modifying the speech prosody and concatenating speech waveforms. The modifications of the speech signal are performed either in the frequency domain (FD-PSOLA), using the Fast Fourier Transform, or directly in the time domain (TD-PSOLA), depending on the length of the window used in the synthesis process. The frequency domain approach is capable of a great flexibility in modifying the spectral characteristics of the speech signal, while the time domain approach provides very efficient solutions for the real time implementation of synthesis systems. We also discuss the different kinds of distortions involved in these different algorithms.

Patterns prosodiques et intentions des locuteurs : le rôle crucial des variables temporelles dans la parole

Article

May 1994

All spoken communications imply a particular intent in the transmission of the content of the message, e.g. simple information, question, approval, regret, admiration... The prosodic patterns which are likely to translate such various intents are frequently studied from an "intonative" standpoint (e.g. melodic contours), but less frequently from a temporal standpoint. The purpose of our study is to define better the time-related regulations of speech when the same statements are expressed by the same speakers but with different intents. The affirmative and interrogative forms of such statements by 12 speakers have been used as reference forms for comparison with expressions of the same statements conveying joy, regret, admiration. The average value of Fo, the melodic contours and the statement durations are measured at the global level of a whole sentence, then at a more local level, word after word, thanks to a sound signal editor, and submitted to statistical analyses. The results provide support for the hypothesis of a gradual construction of specific patterns related to each intention as the message unfolds.

Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion

Article

Mar 1993

There has been considerable research into perceptible correlates of emotional state, but a very limited amount of the literature examines the acoustic correlates and other relevant aspects of emotion effects in human speech; in addition, the vocal emotion literature is almost totally separate from the main body of speech analysis literature. A discussion of the literature describing human vocal emotion, and its principal findings, are presented. The voice parameters affected by emotion are found to be of three main types: voice quality, utterance timing, and utterance pitch contour. These parameters are described both in general and in detail for a range of specific emotions. Current speech synthesizer technology is such that many of the parameters of human speech affected by emotion could be manipulated systematically in synthetic speech to produce a simulation of vocal emotion; application of the literature to construction of a system capable of producing synthetic speech with emotion is discussed.

Generation of emotions by a morphing technique in English, French and Spanish

Abstract and Figures

Recommended publications

Generation of Emotions by a Morphing Technique

Generation of emotions by a morphing technique in English, French and Spanish

MULTI-LINGUAL AUTOMATIC PHONEME CLUSTERING

Approches segmentales multilingues pour l'identification automatique de la langue: phones et syllabe...