Figure - uploaded by Yuki Yamashita
Content may be subject to copyright.
XAB test comparing additional context with baseline one using LSTM.

XAB test comparing additional context with baseline one using LSTM.

Contexts in source publication

Context 1
... tone label, word, clause, speaking style, disfluency, it is seen that the additional context gave significantly higher scores than the baseline context using DNN. Table 2 shows the result where baseline and additional contexts are compared in the experiments using LSTM. In tone label, word, clause, phone prolongation, speaking style, disfluency, it is seen that the additional context gave significantly higher scores than the baseline context using LSTM. ...
Context 2
... tone label, word, clause, speaking style, disfluency, it is seen that the additional context gave significantly higher scores than the baseline context using DNN. Table 2 shows the result where baseline and additional contexts are compared in the experiments using LSTM. In tone label, word, clause, phone prolongation, speaking style, disfluency, it is seen that the additional context gave significantly higher scores than the baseline context using LSTM. ...

Citations

... The effect of robotic motion on conscientiousness indicates that state-machine-like gesture methods may be perceived as more considered and self-disciplined. The consistent dampening effect of the TTS (V R ) may motivate the continuous updating of a virtual agent system to use the best available speech synthesizer, an area of active research [58,66]. Our results also indicate that an actor's personality plays a role, with lowly extraverted actors being rated as more extraverted and highly extraverted actors as less extraverted when represented by processed or no motion. ...
Conference Paper
The portrayed personality of virtual characters and agents is understood to influence how we perceive and engage with digital applications. Understanding how the features of speech and animation drive portrayed personality allows us to intentionally design characters to be more personalized and engaging. In this study, we use performance capture data of unscripted conversations from a variety of actors to explore the perceptual outcomes associated with the modalities of speech and motion. Specifically, we contrast full performance-driven characters to those portrayed by generated gestures and synthesized speech, analysing how the features of each influence portrayed personality according to the Big Five personality traits. We find that processing speech and motion can have mixed effects on such traits, with our results highlighting motion as the dominant modality for portraying extraversion and speech as dominant for communicating agreeableness and emotional stability. Our results can support the Extended Reality (XR) community in development of virtual characters, social agents and 3D User Interface (3DUI) agents portraying a range of targeted personalities.
... As for the spontaneous behaviors, some studies particularly focused on the insertion and synthesis of filler pauses (especially "uh", "um") [14,15,16]. As for the modeling of conversation history, researchers have found that using contextual statistical features can apparently improve naturalness in conversational speech synthesis [17,18]. Besides, the syntactic structure and chat history are also proved to be useful in recent seq2seqbased conversational speech synthesis [19]. ...
... As for the spontaneous behaviors, some studies particularly focused on the insertion and synthesis of filler pauses (especially "uh", "um") [14,15,16]. As for the modeling of conversation history, researchers have found that using contextual statistical features can apparently improve naturalness in conversational speech synthesis [17,18]. Besides, the syntactic structure and chat history are also proved to be useful in recent seq2seqbased conversational speech synthesis [19]. ...
Preprint
In spoken conversations, spontaneous behaviors like filled pause and prolongations always happen. Conversational partner tends to align features of their speech with their interlocutor which is known as entrainment. To produce human-like conversations, we propose a unified controllable spontaneous conversational speech synthesis framework to model the above two phenomena. Specifically, we use explicit labels to represent two typical spontaneous behaviors filled-pause and prolongation in the acoustic model and develop a neural network based predictor to predict the occurrences of the two behaviors from text. We subsequently develop an algorithm based on the predictor to control the occurrence frequency of the behaviors, making the synthesized speech vary from less disfluent to more disfluent. To model the speech entrainment at acoustic level, we utilize a context acoustic encoder to extract a global style embedding from the previous speech conditioning on the synthesizing of current speech. Furthermore, since the current and previous utterances belong to the different speakers in a conversation, we add a domain adversarial training module to eliminate the speaker-related information in the acoustic encoder while maintaining the style-related information. Experiments show that our proposed approach can synthesize realistic conversations and control the occurrences of the spontaneous behaviors naturally.