Xinfa Zhu's research works | Northwestern Polytechnical University and other places

This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy

Preprint

June 2024

3 Reads

[...]

Zero-shot voice conversion (VC) aims to transform source speech into arbitrary unseen target voice while keeping the linguistic content unchanged. Recent VC methods have made significant progress, but semantic losses in the decoupling process as well as training-inference mismatch still hinder conversion performance. In this paper, we propose Vec-Tok-VC+, a novel prompt-based zero-shot VC model improved from Vec-Tok Codec, achieving voice conversion given only a 3s target speaker prompt. We design a residual-enhanced K-Means decoupler to enhance the semantic content extraction with a two-layer clustering process. Besides, we employ teacher-guided refinement to simulate the conversion process to eliminate the training-inference mismatch, forming a dual-mode training strategy. Furthermore, we design a multi-codebook progressive loss function to constrain the layer-wise output of the model from coarse to fine to improve speaker similarity and content accuracy. Objective and subjective evaluations demonstrate that Vec-Tok-VC+ outperforms the strong baselines in naturalness, intelligibility, and speaker similarity.

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

Preprint

June 2024

1 Read

[...]

The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a single-codebook single-sequence codec, which employs a disentangled VQ-VAE to decouple speech into a time-invariant embedding and a phonetically-rich discrete sequence. Furthermore, the encoder is enhanced with 1) contextual modeling with a BLSTM module to exploit the temporal information, 2) a hybrid sampling module to alleviate distortion from upsampling and downsampling, and 3) a resampling module to encourage discrete units to carry more phonetic information. Compared with multi-codebook codecs, e.g., EnCodec and TiCodec, Single-Codec demonstrates higher reconstruction quality with a lower bandwidth of only 304bps. The effectiveness of Single-Code is further validated by LLM-TTS experiments, showing improved naturalness and intelligibility.

Text-aware and Context-aware Expressive Audiobook Speech Synthesis

Preprint

June 2024

7 Reads

[...]

Recent advances in text-to-speech have significantly improved the expressiveness of synthetic speech. However, a major challenge remains in generating speech that captures the diverse styles exhibited by professional narrators in audiobooks without relying on manually labeled data or reference speech.To address this problem, we propose a text-aware and context-aware(TACA) style modeling approach for expressive audiobook speech synthesis. We first establish a text-aware style space to cover diverse styles via contrastive learning with the supervision of the speech style. Meanwhile, we adopt a context encoder to incorporate cross-sentence information and the style embedding obtained from text. Finally, we introduce the context encoder to two typical TTS models, including VITS-based TTS and language model-based TTS. Experimental results demonstrate that our proposed approach can effectively capture diverse styles and coherent prosody, and consequently improves naturalness and expressiveness in audiobook speech synthesis.

Spontts: Modeling and Transferring Spontaneous Style for TTS

Conference Paper

April 2024

1 Read

[...]

SELM: Speech Enhancement using Discrete Tokens and Language Models

Conference Paper

April 2024

5 Reads

3 Citations

[...]

Accent-VITS: Accent Transfer for End-to-End TTS

Chapter

February 2024

31 Reads

[...]

Accent transfer aims to transfer an accent from a source speaker to synthetic speech in the target speaker’s voice. The main challenge is how to effectively disentangle speaker timbre and accent which are entangled in speech. This paper presents a VITS-based [7] end-to-end accent transfer model named Accent-VITS. Based on the main structure of VITS, Accent-VITS makes substantial improvements to enable effective and stable accent transfer. We leverage a hierarchical CVAE structure to model accent pronunciation information and acoustic features, respectively, using bottleneck features and mel spectrums as constraints. Moreover, the text-to-wave mapping in VITS is decomposed into text-to-accent and accent-to-wave mappings in Accent-VITS. In this way, the disentanglement of accent and speaker timbre becomes be more stable and effective. Experiments on multi-accent and Mandarin datasets show that Accent-VITS achieves higher speaker similarity, accent similarity and speech naturalness as compared with a strong baseline (Demos: https://anonymous-accentvits.github.io/AccentVITS/).

METTS: Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer

Article

January 2024

14 Reads

2 Citations

IEEE/ACM Transactions on Audio Speech and Language Processing

[...]

Previous multilingual text-to-speech (TTS) approaches have considered leveraging monolingual speaker data to enable cross-lingual speech synthesis. However, such data-efficient approaches have ignored synthesizing emotional aspects of speech due to the challenges of cross-speaker cross-lingual emotion transfer – the heavy entanglement of speaker timbre , emotion and language factors in the speech signal will make a system to produce cross-lingual synthetic speech with an undesired foreign accent and weak emotion expressiveness. This paper proposes a Multilingual Emotional TTS (METTS) model to mitigate these problems, realizing both cross-speaker and cross-lingual emotion transfer. Specifically, METTS takes DelightfulTTS as the backbone model and proposes the following designs. First, to alleviate the foreign accent problem, METTS introduces multi-scale emotion modeling to disentangle speech prosody into coarse-grained and fine-grained scales, producing language-agnostic and language-specific emotion representations, respectively. Second, as a pre-processing step, formant shift based information perturbation is applied to the reference signal for better disentanglement of speaker timbre in the speech. Third, a vector quantization based emotion matcher is designed for reference selection, leading to decent naturalness and emotion diversity in cross-lingual synthetic speech. Experiments demonstrate the good design of METTS.

Zero-Shot Emotion Transfer for Cross-Lingual Speech Synthesis

Conference Paper

December 2023

5 Reads

[...]

HIGNN-TTS: Hierarchical Prosody Modeling With Graph Neural Networks for Expressive Long-Form TTS

Conference Paper

December 2023

5 Reads

1 Citation

[...]

DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech -- A Study between English and Mandarin

Preprint

September 2023

22 Reads

[...]

While the performance of cross-lingual TTS based on monolingual corpora has been significantly improved recently, generating cross-lingual speech still suffers from the foreign accent problem, leading to limited naturalness. Besides, current cross-lingual methods ignore modeling emotion, which is indispensable paralinguistic information in speech delivery. In this paper, we propose DiCLET-TTS, a Diffusion model based Cross-Lingual Emotion Transfer method that can transfer emotion from a source speaker to the intra- and cross-lingual target speakers. Specifically, to relieve the foreign accent problem while improving the emotion expressiveness, the terminal distribution of the forward diffusion process is parameterized into a speaker-irrelevant but emotion-related linguistic prior by a prior text encoder with the emotion embedding as a condition. To address the weaker emotional expressiveness problem caused by speaker disentanglement in emotion embedding, a novel orthogonal projection based emotion disentangling module (OP-EDM) is proposed to learn the speaker-irrelevant but emotion-discriminative embedding. Moreover, a condition-enhanced DPM decoder is introduced to strengthen the modeling ability of the speaker and the emotion in the reverse diffusion process to further improve emotion expressiveness in speech delivery. Cross-lingual emotion transfer experiments show the superiority of DiCLET-TTS over various competitive models and the good design of OP-EDM in learning speaker-irrelevant but emotion-discriminative embedding.

... We ideally aim for audio tokens that preserves crucial information of the original waveform, including phonetic and linguistic content, speaker identities, emotions, and other paralinguistic cues. However, despite the growing trend toward audio tokens, there is still a lack of standardized evaluation benchmarks, with different studies employing varied experimental settings [24,25,26,27,28]. Without a consistent framework for measuring and comparing performance, it becomes challenging to determine which audio tokens perform optimally across various tasks. ...
Reference:
DASB -- Discrete Audio and Speech Benchmark

SELM: Speech Enhancement using Discrete Tokens and Language Models

Citing Conference Paper
April 2024

[...]

... In recent years, deeplearning based speech synthesis methods have achieved high quality and naturalness [4,5,6]. On expressiveness, stateof-the-art methods show good emotion rendering [7,8,9] and style imitation [10,11]. However, compared to humans especially professional voice actors, their expressiveness and controllability are still very limited. ...
Reference:
Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

METTS: Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer

Citing Article
January 2024

IEEE/ACM Transactions on Audio Speech and Language Processing

[...]

... Particularly, audiobook speech synthesis aims to synthesize expressive long-form speech from literary books, achieving efficient and cost-saving automated audio-content production. However, audiobook synthesis is more challenging due to the rich speaking styles performed by professional narrators, context-aware expressiveness, and long-form prosody coherence [4]. ...
Reference:
Text-aware and Context-aware Expressive Audiobook Speech Synthesis

HIGNN-TTS: Hierarchical Prosody Modeling With Graph Neural Networks for Expressive Long-Form TTS

Citing Conference Paper
December 2023

[...]

... Due to the lack of emotional information, despite the advancements in sequenceto-sequence speech synthesis models [11], TTS models often produce high-quality speech that lacks satisfactory expressiveness. It is imperative to resolve this issue, considering the broad range of applications for emotional speech synthesis, particularly in fields such as education and voice assistants. ...
Reference:
RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis

DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech — A Study between English and Mandarin

Citing Article
January 2023

IEEE/ACM Transactions on Audio Speech and Language Processing

[...]

... Furthermore, CLAM [11] proposes to select multiple references focusing on the stylerelated information in the text. Other works have focused on decoupling a style representation from speech, achieving style controllable speech synthesis [12,13,14]. However, the styles derived from either labeled tags or reference speech are still limited, making it difficult to cover the wide range of rich styles in audiobooks. ...
Reference:
Text-aware and Context-aware Expressive Audiobook Speech Synthesis

Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling

Citing Conference Paper
June 2023

[...]

... For instance, in [31], the features extracted from a pre-trained ASR model are used to present the speaking style. Lei et al. [56] model speaking style on the perturbed waveform in which the speaker identity has been changed. ...
Reference:
MSM-VC: High-fidelity Source Style Transfer for Non-Parallel Voice Conversion by Multi-scale Style Modeling

Cross-Speaker Emotion Transfer Through Information Perturbation in Emotional Speech Synthesis

Citing Article
January 2022

Signal Processing Letters, IEEE

[...]

Xinfa Zhu's research while affiliated with Northwestern Polytechnical University and other places

What is this page?

Publications (17)

Citations (6)