ArticlePDF Available

Generation of emotions by a morphing technique in English, French and Spanish

Authors:
  • Interdisciplinary Laboratory of Digital Sciences

Abstract and Figures

Generating variants becomes a priority for text-to-speech (TTS) synthesis. In particular, additional mark-ups inserted within the text may be used to communicate emotions. Within the framework of a European project linked to the MPEG4 standard (INTERFACE),our purpose is the synthesis of six emotions (anger, disgust, fear, joy, surprise and sadness): this was performed by applying a morphing technique, from the sequence of phonemes and their corresponding prosodic characteristics, for a "neutral" style, generated by a multi-lingual TTS system. We dispose of corpora declined under these six emotions by professional actors in English, French and Spanish: some trends may be drawn, as the inversion of fundamental frequency slopes for disgust and the pruning of melodic movements for sadness. We tend to think that the perceptual identification of the different emotions will be facilitated, within the framework of MPEG4, by the addition of a visual component: a talking head.
Content may be subject to copyright.
A preview of the PDF is not available
... Language specific effects might also be expected (see Section II). For 13 example, whispery voice has been mentioned as being associated with fear in English 14 (Boula de Mareüil et al., 2002), but with different qualities in other languages; breathy 15 voice is traditionally associated with intimacy in English (Laver, 1980) but with 16 formality/politeness in Japanese (Ito, 2004;Ishi et al., 2008). Creaky voice tends also to 17 be associated with different affective states in different languages. ...
Article
Full-text available
The relationship between prosody and perceived affect involves multiple variables. This paper explores the interplay of three: voice quality, f0 contour, and the hearer's language background. Perception tests were conducted with speakers of Irish English, Russian, Spanish, and Japanese using three types of synthetic stimuli: (1) stimuli varied in voice quality, (2) stimuli of uniform (modal) voice quality incorporating affect-related f0 contours, and (3) stimuli combining specific non-modal voice qualities with the affect-related f0 contours of (2). The participants rated the stimuli for the presence/strength of affective colouring on six bipolar scales, e.g., happy-sad. The results suggest that stimuli incorporating non-modal voice qualities, with or without f0 variation, are generally more effective in affect cueing than stimuli varying only in f0. Along with similarities in the affective responses across these languages, many points of divergence were found, both in terms of the range and strength of affective responses overall and in terms of specific stimulus-to-affect associations. The f0 contour may play a more important role, and tense voice a lesser role in affect signalling in Japanese and Spanish than in Irish English and Russian. The greatest cross-language differences emerged for the affects intimate, formal, stressed, and relaxed.
... Fox (2001) note en situation de lecture partagée des accents d'insistance sur les syllables ou les mots sur lesquels le maître veut attirer l'attention. Boiron (2004) remarque que la lecture à haute voix faite à des élèves est une première interprétation et permet de sélectionner certains éléments : elle parle « d'orientation interprétative ». ...
... One of such techniques, concatenative synthesis, automatically recombines large numbers of speech samples so that the resulting sequence matches a target sentence and the resulting sounds match the intended emotion. The emotional content of the concatenated sequence may come from the original speaking style of the pre-recorded samples ("select from the sad corpus") (Eide et al., 2004), result from the algorithmic transformation of neutral samples (Bulut et al., 2005), or from hybrid approaches that morph between different emotional samples (Boula de Mareüil et al., 2002). Another transformation approach to emotional speech synthesis is the recent trends of "voice conversion" research, which tries to impersonate a target voice by modifying a source voice. ...
Article
Full-text available
We present an open-source software platform that transforms emotional cues expressed by speech signals using audio effects like pitch shifting, inflection, vibrato, and filtering. The emotional transformations can be applied to any audio file, but can also run in real time, using live input from a microphone, with less than 20-ms latency. We anticipate that this tool will be useful for the study of emotions in psychology and neuroscience, because it enables a high level of control over the acoustical and emotional content of experimental stimuli in a variety of laboratory situations, including real-time social situations. We present here results of a series of validation experiments aiming to position the tool against several methodological requirements: that transformed emotions be recognized at above-chance levels, valid in several languages (French, English, Swedish, and Japanese) and with a naturalness comparable to natural speech.
... Cahn (1990) has experimented with an " affect editor " , which uses an abstract model of emotional speech along with generation instructions to produce recognizable and sometimes even natural sounding emotions with formant synthesis. Boula de Mareüil et al (2002) have Susanne Schötz Term paper Autumn 2002 synthesized six emotions (anger, disgust, fear, joy, surprise and sadness) in three languages (English, French and Spanish) using corpora and morphing techniques. Organic variation is difficult to manipulate, especially in concatenation synthesis. ...
Article
Full-text available
Phonetic variation, and especially prosodic variation, which is often paralinguistic in nature has gradually attracted more attention among speech researchers and speech scientists as one of the possible solutions to problems with automatic speaker recognition (ASrR) and text-to-speech synthesis (TTS) systems. This paper presents a brief overview of approaches to phonetic variation in ASrR and TTS, beginning with attempts to classify linguistic and paralinguistic phenomena in speech. Also, some of the problems related to paralinguistic phonetic variation and attempted solutions are discussed.
... At the same time, the usage of these means, their level of importance and their meaning vary across languages (Chen, 2005; Abelin & Allwood, 2000). Studies of cross-linguistic production are still quite rare (Boula de Mareuil, Célérier, & Toen, 2002). They usually involve a very small number of speakers per language and disregard a widely observed phenomenon in affective speech -a high interspeaker and intraspeaker variability. ...
Article
Full-text available
H1/ Abstract The main objective of this research is to investigate the production of affective speech by bilingual and monolingual children cross-linguistically. Cross-linguistic differences in affective speech may lead bilingual children to perceive and to express emotions differently in their two different languages. A cross-linguistically comparable corpus of 8 bilingual Scottish-French children and 16 monolingual peers, average age -8, was recorded according to the developed methodology. This chapter presents preliminary results on pitch range, peak alignment and speech rate for bilingual children and their monolingual peers, comparing their emotions and languages.
Chapter
Depression has been affecting people all around the world, including Malaysians. Early detection mechanisms are vital for assisting clinical professionals in identifying depressed patients at an early stage. Although this can be accomplished through interviews and questionnaires, the time-consuming method has several additional disadvantages. Acoustic Measurement and MFCC have notably been adapted to detect speaker emotion. Numerous researchers have employed various languages for the purpose of prediction. Its efficiency varies across research, although it contributes significantly to diagnosing depression. As it appears that culture diversity influences how emotion is perceived, depression detection mechanism can vary between different languages. This paper provides a comprehensive analysis based on relevant studies published from 2000 to 2023 to show the effectiveness of acoustic measurement and MFCC in depression detection. It was discovered that Support Vector Machine (SVM) is extensively utilised and can successfully contribute to the detection of depressed patients using biometric characteristics. The outcome of this study encourages experimental investigation on the effectiveness of acoustic measuring and MFCC for depression identification among Malaysian speakers.
Article
Purpose Capturing phonation types such as breathy, modal, and pressed voices precisely can facilitate the recognition of human emotions. However, little is known about how exactly phonation types and decoders' gender influence the perception of emotional speech. Based on the modified Brunswikian lens model, this article aims to examine the roles of phonation types and decoders' gender in Mandarin emotional speech recognition by virtue of articulatory speech synthesis. Method Fifty-five participants (28 male and 27 female) completed a recognition task of Mandarin emotional speech, with 200 stimuli representing five emotional categories (happiness, anger, fear, sadness, and neutrality) and five types (original, copied, breathy, modal, and pressed). Repeated-measures analyses of variance were performed to analyze recognition accuracy and confusion data. Results For male and female decoders, the recognition accuracy of anger from pressed stimuli and fear from breathy stimuli was high; across all phonation-type stimuli, the recognition accuracy of sadness was also high, but that of happiness was low. The confusion data revealed that in recognizing fear from all phonation-type stimuli, female decoders chose fear responses more frequently and neutral responses less frequently than male decoders. In recognizing neutrality from breathy stimuli, female decoders significantly reduced their choice of neutral responses and misidentified neutrality as anger, while male decoders mistook neutrality from pressed stimuli for anger. Conclusions This study revealed that, in Mandarin, phonation types play crucial roles in recognizing anger, fear, and neutrality, while the recognition of sadness and happiness seems not to depend heavily on phonation types. Moreover, the decoders' gender affects their recognition of neutrality and fear. These findings support the modified Brunswikian lens model and have significance for diagnosis and intervention among clinical populations with hearing impairment or gender-related psychiatric disorders. Supplemental Material https://doi.org/10.23641/asha.24302221
Article
Full-text available
The use of new technological and learning methods that help to improve the learning process has resulted in the inclusion of the video games as active elements in the classrooms. Videogames are ideal learning tools since they provide training skills, promote independence and increase and improve students’ concentration and attention. For special education students with learning difficulties, it is very important to adapt the game to each student’s cognitive level and skills. New game technologies have helped to create alternative strategies to increase cognitive skills in the field of Special Education. This chapter describes our experience in video games design and in new forms of human–computer interaction addressed to develop didactic games for children with communication problems such as autism, dysphasia, ictus or some types of cerebral paralysis.
Article
Full-text available
This paper reviews some of the recent issues and findings in the area of production and perception of expressive speech and the application to speech synthesis. Specifically, it discusses some of the current problems with data collection, labeling, techniques for analyzing voice quality and applying speech synthesis as an analysis tool. Directions for future work in order to improve synthesis of expressive speech are suggested along the lines of better modeling, labeling and voice quality analysis.
Article
Full-text available
The inclusion of emotional aspects into speech can improve the naturalness of speech synthesis system. The different emotions -sadness, angry, happiness are manifested in speech as prosodic elements like time duration, pitch and intensity. The prosodic values corresponding to different emotions are analyzed at word as well as phonemic level, using speech analysis and manipulation tool PRAAT. This paper presents the emotional analysis of the prosodic features such as duration, pitch and intensity of Malayalam speech. The analysis shows that duration is generally least for anger and highest for sadness, where as intensity is highest for anger and least for sadness. A new prosodic feature called rise time/fall time which can capture both durational and intensity variation, is introduced. The pitch contour which is flat for neutral speech shows significant variation for different emotions. The detailed analysis considering the duration of different phonemes reveals that the duration variation is significantly more for vowels compared to consonants.
Article
Full-text available
This contribution argues that speech technologies, specifically speaker verification, speech recognition and speech synthesis, need to model the effects of speaker state (attitudes and emotions) in order to increase their quality and acceptance. Experimental research on the effects of stress and emotion on voice quality and prosody is reviewed and linked to basic dimensions of speech communication. It is concluded that the current state of the art in this area can provide speech technologists with important leads for modeling affective speaker states. We argue that there is a strong need for increased collaboration between speech engineers and speech scientists from other disciplines.
Conference Paper
Attempts to add emotion effects to synthesised speech have existed for more than a decade now. Several prototypes and fully operational systems have been built based on different synthesis techniques, and quite a number of smaller studies have been conducted. This paper aims to give an overview of what has been done in this field, pointing out the inherent properties of the various synthesis techniques used, summarising the prosody rules employed, and taking a look at the evaluation paradigms. Finally, an attempt is made to discuss interesting directions for future development.
Article
We review in a common framework several algorithms that have been proposed recently, in order to improve the voice quality of a text-to-speech synthesis based on acoustical units concatenation (Charpentier and Moulines, 1988; Moulines and Charpentier, 1988; Hamon et al., 1989). These algorithms rely on a pitch-synchronous overlap-add (PSOLA) approach for modifying the speech prosody and concatenating speech waveforms. The modifications of the speech signal are performed either in the frequency domain (FD-PSOLA), using the Fast Fourier Transform, or directly in the time domain (TD-PSOLA), depending on the length of the window used in the synthesis process. The frequency domain approach is capable of a great flexibility in modifying the spectral characteristics of the speech signal, while the time domain approach provides very efficient solutions for the real time implementation of synthesis systems. We also discuss the different kinds of distortions involved in these different algorithms.
Article
All spoken communications imply a particular intent in the transmission of the content of the message, e.g. simple information, question, approval, regret, admiration... The prosodic patterns which are likely to translate such various intents are frequently studied from an "intonative" standpoint (e.g. melodic contours), but less frequently from a temporal standpoint. The purpose of our study is to define better the time-related regulations of speech when the same statements are expressed by the same speakers but with different intents. The affirmative and interrogative forms of such statements by 12 speakers have been used as reference forms for comparison with expressions of the same statements conveying joy, regret, admiration. The average value of Fo, the melodic contours and the statement durations are measured at the global level of a whole sentence, then at a more local level, word after word, thanks to a sound signal editor, and submitted to statistical analyses. The results provide support for the hypothesis of a gradual construction of specific patterns related to each intention as the message unfolds.
Article
There has been considerable research into perceptible correlates of emotional state, but a very limited amount of the literature examines the acoustic correlates and other relevant aspects of emotion effects in human speech; in addition, the vocal emotion literature is almost totally separate from the main body of speech analysis literature. A discussion of the literature describing human vocal emotion, and its principal findings, are presented. The voice parameters affected by emotion are found to be of three main types: voice quality, utterance timing, and utterance pitch contour. These parameters are described both in general and in detail for a range of specific emotions. Current speech synthesizer technology is such that many of the parameters of human speech affected by emotion could be manipulated systematically in synthetic speech to produce a simulation of vocal emotion; application of the literature to construction of a system capable of producing synthetic speech with emotion is discussed.