Figure 3 - uploaded by Shubham Bansal
Content may be subject to copyright.
MOS comparison with different models on the English, Code-mixed and Regional test-set.

MOS comparison with different models on the English, Code-mixed and Regional test-set.

Contexts in source publication

Context 1
... English synthesizer 3) MiFMiS SH: Mixlingual front-end and mixlingual synthesizer (trained with "shared" phoneset) 4) MiFMiS SE: Mixlingual front-end and mixlingual synthesizer (trained with "separated" phoneset). We also compare them with the highquality recordings of our bilingual voice talent. MOS results are shown in Table 4 and compared in Fig. ...
Context 2
... English synthesizer 3) MiFMiS SH: Mixlingual front-end and mixlingual synthesizer (trained with "shared" phoneset) 4) MiFMiS SE: Mixlingual front-end and mixlingual synthesizer (trained with "separated" phoneset). We also compare them with the highquality recordings of our bilingual voice talent. MOS results are shown in Table 4 and compared in Fig. ...

Citations

... A TTS system is generally a pipeline of three components, as shown in Fig. 1: front-end, acoustic model, and vocoder. The front-end plays a crucial role in a TTS system to provide the required phoneme-relevant linguistic knowledge [37]- [39]. We propose an accented TTS system that consists of an accented front-end and an accented acoustic model. ...
Article
Full-text available
This paper presents an accented text-to-speech (TTS) synthesis framework with limited training data. We study two aspects concerning accent rendering: phonetic (phoneme difference) and prosodic (pitch pattern and phoneme duration) variations. The proposed accented TTS framework consists of two models: an accented front-end for grapheme-to-phoneme (G2P) conversion and an accented acoustic model with integrated pitch and duration predictors for phoneme-to-Mel-spectrogram prediction. The accented front-end directly models the phonetic variation, while the accented acoustic model explicitly controls the prosodic variation. Specifically, both models are first pretrained on a large amount of data, then only the accent-related layers are fine-tuned on a limited amount of data for the target accent. In the experiments, speech data of three English accents, i.e., General American English, Irish English, and British English Received Pronunciation, are used for pre-training. The pretrained models are then fine-tuned with Scottish and General Australian English accents, respectively. Both objective and subjective evaluation results show that the accented TTS frontend fine-tuned with a small accented phonetic lexicon (5k words) effectively handles the phonetic variation of accents, while the accented TTS acoustic model fine-tuned with a limited amount of accented speech data (approximately 3 minutes) effectively improves the prosodic rendering including pitch and duration. The overall accent modeling contributes to improved speech quality and accent similarity.
... Further, different Grapheme to Phoneme (G2P) for English words and regional words were utilized in [31,39]. A single mix-lingual G2P model instead of two separate models were proposed in [3]. In [45], embeddings from an external cross-lingual language model were integrated into the fronted of Tacotron2 model along with the original phone embeddings. ...
Chapter
Full-text available
Text-to-speech (TTS) systems are an important component in voice-based e-commerce applications. These applications include end-to-end voice assistant and customer experience (CX) voice bot. Code-mixed TTS is also relevant in these applications since the product names are commonly described in English while the surrounding text is in a regional language. In this work, we describe our approaches for production quality code-mixed Hindi-English TTS systems built for e-commerce applications. We propose a data-oriented approach by utilizing monolingual data sets in individual languages. We leverage a transliteration model to convert the Roman text into a common Devanagari script and then combine both datasets for training. We show that such single script bi-lingual training without any code-mixing works well for pure code-mixed test sets. We further present an exhaustive evaluation of single-speaker adaptation and multi-speaker training with Tacotron2 + Waveglow setup to show that the former approach works better. These approaches are also coupled with transfer learning and decoder-only fine-tuning to improve performance. We compare these approaches with the Google TTS and report a positive CMOS score of 0.02 with the proposed transfer learning approach. We also perform low-resource voice adaptation experiments to show that a new voice can be onboarded with just 3 hrs of data. This highlights the importance of our pre-trained models in resource-constrained settings. This subjective evaluation is performed on a large number of out-of-domain pure code-mixed sentences to demonstrate the high quality of the systems.
... Based on the disentanglement strategy, the existing crosslingual approaches can be roughly divided into implicitbased and explicit-based methods [9]. Implicit-based methods mainly study the unified linguistic/phonetic representations across languages to disentangle language and speaker timbre implicitly [11], [12], [13], [14], [15], [16]. On the other hand, to further solve the foreign accent problem, the explicit-based methods prefer to adopt adversarial learning [1], [7], [9], [17] or mutual information [6] to minimize the correlation between different speech factors, thus encouraging the model to automatically learn disentangled linguistic representations. ...
... Most current studies realize cross-lingual TTS by mixing monolingual corpora of different languages while disentangling the speaker and language or linguistic representations in implicit or explicit ways to alleviate the foreign accent problem. Implicit methods mainly focus on exploring languageirrelevant input representations [11], [12], [14], [15]. Liu et al. [42] introduce a shared phoneme set for different languages. ...
... In [11], [12], the Automatic Speech Recognition (ASR) models are employed to extract language-irrelevant Phonetic Posterior Gram (PPG) features as the input representations. Unicode bytes [13], mixed-lingual Grapheme-to-Phoneme (G2P) [14] frontend, and International Phonetic Alphabet (IPA) [43], [15], [16] are also taken as the unified phonetic representations that share pronunciation across languages [9]. These studies indicate that language-irrelevant representations can help disentangle speaker and language, but the complexity of the cross-lingual TTS pipeline is increased. ...
Preprint
While the performance of cross-lingual TTS based on monolingual corpora has been significantly improved recently, generating cross-lingual speech still suffers from the foreign accent problem, leading to limited naturalness. Besides, current cross-lingual methods ignore modeling emotion, which is indispensable paralinguistic information in speech delivery. In this paper, we propose DiCLET-TTS, a Diffusion model based Cross-Lingual Emotion Transfer method that can transfer emotion from a source speaker to the intra- and cross-lingual target speakers. Specifically, to relieve the foreign accent problem while improving the emotion expressiveness, the terminal distribution of the forward diffusion process is parameterized into a speaker-irrelevant but emotion-related linguistic prior by a prior text encoder with the emotion embedding as a condition. To address the weaker emotional expressiveness problem caused by speaker disentanglement in emotion embedding, a novel orthogonal projection based emotion disentangling module (OP-EDM) is proposed to learn the speaker-irrelevant but emotion-discriminative embedding. Moreover, a condition-enhanced DPM decoder is introduced to strengthen the modeling ability of the speaker and the emotion in the reverse diffusion process to further improve emotion expressiveness in speech delivery. Cross-lingual emotion transfer experiments show the superiority of DiCLET-TTS over various competitive models and the good design of OP-EDM in learning speaker-irrelevant but emotion-discriminative embedding.
... The evaluation findings demonstrate substantial gains in ASR accuracy and resilience for various languages with limited resources. [14] proposed a novel grapheme-to-phoneme (G2P) model for code-mixed speech synthesis termed Mixlingual, that employs a sequence-to-sequence architecture with an attention mechanism to transition between several language models based on the input dynamically. Assessment of Hindi-English code-mixed voice synthesis tasks demonstrates that the proposed Mixlingual technique outperforms previous state-of-the-art models for synthesized speech's naturalness and understandability. ...
Conference Paper
Full-text available
Developing reliable Automatic Speech Recognition (ASR) system for Indian Languages has been challenging due to the limited availability of large-scale, high-quality speech datasets. This problem is even more pronounced when dealing with noisy code-mixed settings with different grapheme vocabularies. This paper proposes a novel ASR system for low-resource noisy speech code mixed with Indian languages. Our approach involves fine-tuning pre-trained models using text transliterated to Devanagari and mapping similar-sounding characters into one character group. Experiments show the model’s effectiveness for low-resource Indian languages, including noisy, code-mixed, and multilingual settings. The approach outperforms several baseline models and demonstrates the potential for adapting state-of-the-art ASR models to new languages with limited resources. The proposed system has been deployed in production, where call centers use it to transcribe customer calls.
... A TTS system is generally a pipeline of three components, as shown in Fig. 1: front-end, acoustic model, and vocoder. The front-end plays a crucial role in a TTS system to provide the required phoneme-relevant linguistic knowledge [34]- [36]. We propose an accented TTS system that consists of an accented front-end and an accented acoustic model. ...
Preprint
This paper presents an accented text-to-speech (TTS) synthesis framework with limited training data. We study two aspects concerning accent rendering: phonetic (phoneme difference) and prosodic (pitch pattern and phoneme duration) variations. The proposed accented TTS framework consists of two models: an accented front-end for grapheme-to-phoneme (G2P) conversion and an accented acoustic model with integrated pitch and duration predictors for phoneme-to-Mel-spectrogram prediction. The accented front-end directly models the phonetic variation, while the accented acoustic model explicitly controls the prosodic variation. Specifically, both models are first pre-trained on a large amount of data, then only the accent-related layers are fine-tuned on a limited amount of data for the target accent. In the experiments, speech data of three English accents, i.e., General American English, Irish English, and British English Received Pronunciation, are used for pre-training. The pre-trained models are then fine-tuned with Scottish and General Australian English accents, respectively. Both objective and subjective evaluation results show that the accented TTS front-end fine-tuned with a small accented phonetic lexicon (5k words) effectively handles the phonetic variation of accents, while the accented TTS acoustic model fine-tuned with a limited amount of accented speech data (approximately 3 minutes) effectively improves the prosodic rendering including pitch and duration. The overall accent modeling contributes to improved speech quality and accent similarity.
... In [11,12], language-independent Phonetic PosteriorGram (PPG) features of ASR models are used as input for cross-lingual TTS models. [13] further proposes a mixedlingual grapheme-to-phoneme (G2P) frontend to improve the pronunciation of mixed-lingual sentences in cross-lingual TTS systems. ...
... Speech synthesis emerged as such a new technology was applied in this study to help learners improve learning efficiency and break through the bottleneck of non-standard pronunciation of Chinese ESL teachers [35]- [37]. Speech synthesis refers to the input text converted into a form that can be phonetised by a machine [38], [39]. According to Sefara [41], factors such as stress, phonetics, intonation, accent, age and motivation can cause inappropriate or incorrect pronunciation from non-natives. ...
... The speech was recognized as natural, pleasant, and understandable. A similar study conducted by Bansal et al. [38] showed a rise of 28% in pronunciation accuracy and a 0.9 gain in mean-opinion-score (MOS). Werner and Hoffmann [46] evaluated the quality of different approaches and obtained a high mean opinion score (MOS) for both synthetic and natural speech samples. ...
Article
Full-text available
As the most important foreign language in China, English poses great challenge for teachers to improve the performance of ESL (English as a second language) learners through in-class instruction. Blended, explicit, and implicit instruction are three widely used approaches in English class teaching, and it may be difficult to choose which one should be used in class to help the English language learners. With audio synthesis technology applied as an enhancement for teaching, this study aims to research the three effects on English pronunciation teaching in aspects of the performance and satisfaction of English language learners. 120 English learners in China were equally divided into three groups which were instructed by blended, explicit, and implicit instruction respectively. Based on the data collected by test and questionnaire in pre-test, in-test, and post-test, the results show that the blended instruction performs best in aspects of the improvement of performance and the class satisfaction. These findings indicate the great potential of blended instruction in English language teaching and more investment should be taken to promote its application to help the Chinese to better acquire this important language.
... G2P conversion has been studied in various ways, including rules, dictionaries, statisticalbased methods (Deri and Knight, 2016) and neural network-based methods (Yolchuyeva et al., 2021;Sun et al., 2019a;Choi et al., 2021). Currently, monolingual G2P research is the most conducted, although recently bilingual or multilingual G2P research is also being actively performed Yu et al., 2020;Bansal et al., 2020;Gautam et al., 2021). Most of the proposed models with high performance are based on autoregressive transformers (A. ...
Article
While the performance of cross-lingual TTS based on monolingual corpora has been significantly improved recently, generating cross-lingual speech still suffers from the foreign accent problem, leading to limited naturalness. Besides, current cross-lingual methods ignore modeling emotion, which is indispensable paralinguistic information in speech delivery. In this paper, we propose DiCLET-TTS, a Diffusion model based Cross-Lingual Emotion Transfer method that can transfer emotion from a source speaker to the intra- and cross-lingual target speakers. Specifically, to relieve the foreign accent problem while improving the emotion expressiveness, the terminal distribution of the forward diffusion process is parameterized into a speaker-irrelevant but emotion-related linguistic prior by a prior text encoder with the emotion embedding as a condition. To address the weaker emotional expressiveness problem caused by speaker disentanglement in emotion embedding, a novel orthogonal projection based emotion disentangling module (OP-EDM) is proposed to learn the speaker-irrelevant but emotion-discriminative embedding. Moreover, a condition-enhanced DPM decoder is introduced to strengthen the modeling ability of the speaker and the emotion in the reverse diffusion process to further improve emotion expressiveness in speech delivery. Cross-lingual emotion transfer experiments show the superiority of DiCLET-TTS over various competitive models and the good design of OP-EDM in learning speaker-irrelevant but emotion-discriminative embedding.