PreprintPDF Available

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

August 2020
IEEE/ACM Transactions on Audio Speech and Language Processing 29

August 2020
29

DOI:10.1109/TASLP.2020.3038524

License
CC BY 4.0

Authors:

Berrak Sisman

University of Texas at Dallas

Junichi Yamagishi

National Institute of Informatics, Japan

Haizhou Li

National University of Singapore

Preprints and early-stage research may not have been peer reviewed yet.

Speaker identity is one of the important characteristics of human speech. In voice conversion, we change the speaker identity from one to another, while keeping the linguistic content unchanged. Voice conversion involves multiple speech processing techniques, such as speech analysis, spectral conversion, prosody conversion, speaker characterization, and vocoding. With the recent advances in theory and practice, we are now able to produce human-like voice quality with high speaker similarity. In this paper, we provide a comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and discuss their promise and limitations. We will also report the recent Voice Conversion Challenges (VCC), the performance of the current state of technology, and provide a summary of the available resources for voice conversion research.

Illustration of non-negative matrix factorization for exemplar-based sparse representation.

…

Figures - uploaded by Berrak Sisman

Content may be subject to copyright.

Content uploaded by Berrak Sisman

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

An Overview of Voice Conversion and

its Challenges: From Statistical Modeling to

Deep Learning

Berrak Sisman, Member, IEEE, Junichi Yamagishi, Senior Member, IEEE, Simon King, Fellow, IEEE,

and Haizhou Li, Fellow, IEEE

Abstract—Speaker identity is one of the important charac-

teristics of human speech. In voice conversion, we change the

speaker identity from one to another, while keeping the lin-

guistic content unchanged. Voice conversion involves multiple

speech processing techniques, such as speech analysis, spectral

conversion, prosody conversion, speaker characterization, and

vocoding. With the recent advances in theory and practice, we

are now able to produce human-like voice quality with high

speaker similarity. In this paper, we provide a comprehensive

overview of the state-of-the-art of voice conversion techniques

and their performance evaluation methods from the statistical

approaches to deep learning, and discuss their promise and

limitations. We will also report the recent Voice Conversion

Challenges (VCC), the performance of the current state of

technology, and provide a summary of the available resources

for voice conversion research.

Index Terms—Voice conversion, speech analysis, speaker

characterization, vocoding, voice conversion evaluation, voice

conversion challenges.

I. INTRODUCTION

Voice conversion (VC) is a signiﬁcant aspect of artiﬁ-

cial intelligence. It is the study of how to convert one’s

voice to sound like that of another without changing the

linguistic content. Voice conversion belongs to a general

technical ﬁeld of speech synthesis, which converts text to

speech or changes the properties of speech, for example,

voice identity, emotion, and accents. Stewart, a pioneer in

speech synthesis, commented in 1922 [1], the really difﬁcult

problem involved in the the artiﬁcial production of speech-

sounds is not the making of a device which shall produce

speech, but in the manipulation of the apparatus. As voice

conversion is focused on the manipulation of voice identity

in speech, it represents one of the challenging research

problems in speech processing.

There has been a continuous effort in quest for effec-

tive manipulation of speech properties since the debut of

computer-based speech synthesis in the 1950s. The rapid

development of digital signal processing in the 1970s greatly

Berrak Sisman is with the Information Systems Technology and Design

(ISTD) Pillar of Singapore University of Technology and Design (SUTD),

Singapore.

Junichi Yamagishi is with National Institute of Informatics, Japan and

University of Edinburgh, United Kingdom.

Simon King is with the University of Edinburgh, United Kingdom.

Haizhou Li is with the Department of Electrical and Computer Engi-

neering, National University of Singapore.

facilitated the control of the parameters for speech manip-

ulation. While the original motivation of voice conversion

could be simply novelty and curiosity, the technological

advancements from statistical modeling to deep learning

have made a major impact on many real-life applications,

and beneﬁted the consumers, such as personalized speech

synthesis [2], [3], communication aids for the speech-

impaired [4], speaker de-identiﬁcation [5], voice mimicry [6]

and disguise [7], and voice dubbing for movies.

In general, a speaker can be characterized by three factors

that are 1) linguistic factors that are reﬂected in sentence

structure, lexical choice, and idiolect; 2) supra-segmental

factors such as the prosodic characteristics of a speech

signal, and 3) segmental factors that are related to short

term features, such as spectrum and formants. When the

linguistic content is ﬁxed, the supra-segment and the seg-

mental factors are the relevant factors concerning speaker

individuality. An effective voice conversion technique is

expected to convert both the supra-segment and the seg-

mental factors. Despite much progress, voice conversion

is still far from perfect. In this paper, we celebrate the

technological advances, at the same time we expose their

limitations. We will discuss the state-of-the-art technology

from historical and technological perspectives.

A typical voice conversion pipeline includes a speech

analysis, mapping, and reconstruction modules as illus-

trated in Figure 1, that is referred to as analysis-mapping-

reconstruction pipeline. The speech analyzer decomposes

the speech signals of a source speaker into features that

represent supra-segmental and segmental information, and

the mapping module changes them towards the target

speaker, ﬁnally the reconstruction module re-synthesizes

time-domain speech signals. The mapping module has

taken the centre stage in many of the studies. These tech-

niques can be categorized in different ways, for example,

based on the use of training data - parallel vs non-parallel,

the type of statistical modeling technique - parametric vs

non-parametric, the scope of optimization - frame level vs

utterance level, and the workﬂow of conversion - direct

mapping vs inter-lingual. Let’s ﬁrst give an account from

the perspective of the use of training data.

The early studies of voice conversion were focused

on spectrum mapping using parallel training data, where

speech of the same linguistic content is available from

both the source and target speaker, for example, vector

arXiv:2008.03648v1 [eess.AS] 9 Aug 2020

quantization (VQ) [8] and fuzzy vector quantization [9].

With parallel data, one can align the two utterances using

Dynamic Time Warping [10]. The statistical parametric ap-

proaches can beneﬁt from more training data for improved

performance, just to name a few, Gaussian mixture model

[11]–[13], partial least square regression [14] and dynamic

kernel partial least squares regression (DKPLS) [15].

One of the successful statistical non-parametric tech-

niques is based on non-negative matrix factorization (NMF)

[16] and it is known as the exemplar-based sparse repre-

sentation technique [17]–[20]. It requires a smaller amount

of training data than the parametric techniques, and ad-

dresses well the over-smoothing problem. The family of

sparse representation techniques include phonetic sparse

representation, group sparsity implementation [21], [22],

that greatly improved the voice quality on small parallel

training dataset.

The studies on voice conversion towards non-parallel

training data [23]–[28] open up the opportunities for new

applications. The challenge is how to establish the mapping

between non-parallel source and target utterances. The

INCA alignment technique by Erro et al. [27] represents

one of the solutions to the non-parallel data alignment

problem [29]. With the alignment techniques, one is able

to extend the voice conversion techniques from parallel

data to non-parallel data, such as the extension to DKPLS

[30] and speaker model alignment method [31]. Phonetic

Posteriograms, or PPG-based [32], approach represents an-

other direction of research towards non-parallel training

data. While the alignment technique doesn’t use external

resources, the PPG-based approach makes use of auto-

matic speech recognizer to generate intermediate phonetic

representation [33], [34] as the inter-lingual between the

speakers. Successful applications include Phonetic Sparse

Representation [22].

Wu and Li [6], and Mohammadi and Kain [35] provided

an overview of voice conversion systems from the per-

spective of time alignment of speech features followed by

feature mapping, that represents the statistical modeling

school of thoughts. The advent of deep learning techniques

represents an important technology milestone in the voice

conversion research [36]. It has not only greatly advanced

the state-of-the-art, but also transformed the way we for-

mulate the voice conversion research problems. It also

opens up a new direction of research beyond the parallel

and non-parallel data paradigm. Nonetheless, the studies

on statistical modeling approaches have provided profound

insights into many aspects of the research problems that

serve as the foundation work of today’s deep learning

methodology. In this paper, we will give an overview of voice

conversion research by providing a perspective that reveals

the underlying design principles from statistical modeling

to deep learning.

Deep learning’s contributions to voice conversion can be

summarized in three areas. Firstly, it allows the mapping

module to learn from a large amount of speech data,

therefore, tremendously improves voice quality and simi-

larity to target speaker. With neural networks, we see the

mapping module as a nonlinear transformation function

[37], that is trained from data [38], [39]. LSTM represents a

successful implementation with parallel training data [40].

Deep learning made a great impact on non-parallel data

techniques. The joint use of DBLSTM and i-vector [41], KL

divergence and DNN-based approach [42], variational auto-

encoder [43], average modeling [44] and DBLSTM based

Recurrent Neural Networks [32], [45] bring the voice quality

to a new height. More recently, Generative Adversarial

Networks such as VAW-GAN [46], CycleGAN [47]–[49], and

StarGAN [50] further advance the state-of-the-art.

Secondly, deep learning has created a profound impact

on vocoding technology. Speech analysis and reconstruc-

tion modules are typically implemented using a traditional

parametric vocoder [11]–[13], [51]. The parameters of such

vocoders are manually tuned according to some over-

simpliﬁed assumptions in signal processing. As a result,

the parametric vocoders offer a suboptimal solution. Neural

vocoder is a neural network that learns to reconstruct an

audio waveform from acoustic features [52]. For the ﬁrst

time, neural vocoder becomes trainable and data-driven.

WaveNet vocoder [53] represents one of the popular neural

vocoders, that directly estimates waveform samples from

the input feature vectors. It has been studied intensively,

for example, speaker dependent and independent WaveNet

vocoder [54], [55], quasi-periodic WaveNet vocoder [56],

[57], adaptive WaveNet vocoder with GANs [58], factorized

WaveNet vocoder [59], and reﬁned WaveNet vocoder with

VAEs [60] that are known for their natural sounding voice

quality. WaveNet vocoder is also widely adopted in tradi-

tional voice conversion pipeline, such as GMM [54], sparse

representation [61], [62] systems. Other successful neural

vocoders include WaveRNN vocoder [63], WaveGlow [64],

that are excellent vocoders in their own right.

Thirdly, deep learning represents a departure from the

traditional analysis-mapping-reconstruction pipeline. All

the above techniques largely follow the voice conversion

pipeline as in Figure 1. As neural vocoder is trainable, it

can be trained jointly with mapping module [58] and even

with analysis module to become end-to-end solution [53].

Voice conversion research used to be a niche area in

speech synthesis. However, it has become a major topic

in recent years. In the 45th International Conference on

Acoustics, Speech, and Signal Processing (ICASSP 2020),

voice conversion papers represent more than one-third of

the papers under the speech synthesis category. The growth

of research community was accelerated by collaborative

activities across academia and industry, such as voice

conversion challenge (VCC) 2016, which was ﬁrst launched

[65]–[67] at INTERSPEECH 2016. VCC 2016 is focused on the

most basic voice conversion task, that is voice conversion

for parallel training data recorded in acoustic studio. It

establishes the evaluation methodology and protocol for

performance benchmarking, that are adopted widely in the

community. VCC 2018 [68]–[70] proposes a non-parallel

training data challenge, and also connects voice conversion

with anti-spooﬁng of speaker veriﬁcation studies. VCC 2020

puts forward a cross-lingual voice conversion challenge for

Training

Mapping Reconstruction

Analysis &

Feature Extraction

Target Speech

Source Speech Converted Speech

Conversion Model

Source Speech

Fig. 1: The typical ﬂow of a voice conversion system. The pink box represents the training of the mapping function, while

the blue box applies the mapping function at run-time, in a 3-step pipeline process Y=(R◦F◦A)(X).

the ﬁrst time. We will provide an overview of the series

of challenges and the publicly available resources in this

paper.

This paper is organized as follows: In Section II, we

present the typical ﬂow of voice conversion that includes

feature extraction, feature mapping and waveform gener-

ation. In Section III, we study the statistical modeling for

voice conversion with parallel training data. In Section IV,

we study statistical modeling for voice conversion without

parallel training data. In Section V, we study the deep learn-

ing approaches for voice conversion with parallel training

data, and beyond parallel training data. In Section VI, we

explain the evaluation techniques for voice conversion. In

Section VII and VIII, we summarize the series of voice

conversion challenges, and publicly available research re-

sources for voice conversion. We conclude in Section IX.

II. TY PI CA L FLOW OF VOICE CO NV ER SI ON

The goal of voice conversion is to modify a source

speaker’s voice to sound as if it is produced by a target

speaker. In other words, a voice conversion system only

modiﬁes the speaker-dependent characteristics of speech,

such as formants, fundamental frequency (F0), intonation,

intensity and duration, while carrying over the speaker-

independent speech content.

The core module of a voice conversion system performs

the conversion function. Let’s denote the source and target

speech signals as Xand Yrespectively. As will be discussed

later, voice conversion is typically applied to some inter-

mediate representation of speech, or speech feature, that

characterizes a speech frame. Let’s denote the source and

target speech features as xand y. The conversion function

can be formulated as follows,

y=F(x) (1)

where F(·) is also called mapping function in rest of this

paper. As illustrated in Figure 1, a typical voice conversion

framework is implemented in three steps: 1) speech analy-

sis, 2) feature mapping, and 3) speech reconstruction, that

we call the analysis-mapping-reconstruction pipeline. We

discuss in detail next.

A. Speech Analysis and Reconstruction

The speech analysis and reconstruction are two cru-

cial processes in the 3-step pipeline. The goal of speech

analysis is to decompose speech signals into some form

of intermediate representation for effective manipulation

or modiﬁcation with respect to the acoustic properties of

speech. There have been many useful intermediate repre-

sentation techniques that were initially studied for speech

communication, and speech synthesis. They become handy

for voice conversion. In general, the techniques can be

categorized into model-based representations, and signal-

based representations.

In model-based representation, we assume that speech

signal is generated according to a underlying physical

model, such as source-ﬁlter model, and express a frame of

speech signal as a set of model parameters. By modifying

the parameters, we manipulate the input speech. In signal-

based representation, we don’t assume any models, but

rather represent speech as a composition of controllable

elements in time domain or frequency domain. Let’s denote

the intermediate representation for source speaker as x,

speech analysis can be described by a function,

x=A(X) (2)

Speech reconstruction can be seen as an inverse function

of the speech analysis, that operates on the modiﬁed

parameters and generates an audible speech signal. It works

with speech analysis in tandem. For example, A vocoder [51]

is used to express a speech frame with a set of controllable

parameters that can be converted back into a speech

waveform. A Grifﬁn-Lim algorithm is used to reconstruct a

speech signal from a modiﬁed short-time Fourier transform

after amplitude modiﬁcation [71]. As the output speech

quality is affected by the speech reconstruction process,

speech reconstruction is also one of the important topics

in voice conversion research. Let’s denote the modiﬁed

intermediate representation and the reconstructed speech

signal for target speaker as yand Y=R(y), voice conversion

can be described by a composition of three functions,

Y=(R◦F◦A)(X)

=C(X)(3)

that represents the typical ﬂow of a voice conversion system

as a 3-step pipeline. As the mapping is applied frame-by-

frame, the number of converted speech features yis the

same as that of the source speech features xif speech

duration is not modiﬁed in the process.

While speech analysis and reconstruction make pos-

sible voice conversion, just like other signal processing

techniques, they inevitably also introduce artifacts. Many

studies were devoted to minimize such artifacts. We next

discuss the most commonly used speech analysis and

reconstruction techniques in voice conversion.

1) Signal-based Representation: Pitch Synchronous Over-

Lap and Add (PSOLA) is an example of signal-based rep-

resentation techniques. It decomposes a speech signal into

overlapping speech segments [72], each of which represents

one of the successive pitch periods of the speech signal. By

overlap-and-adding these speech segments with a different

pitch periods, we can reconstruct the speech signal of a dif-

ferent intonation. As PSOLA operates directly on the time-

domain speech signal [72], the analysis and reconstruction

do not introduce signiﬁcant artifacts. While PSOLA tech-

nique is effective for modiﬁcation of fundamental frequency

of speech signals, it suffers from several inherent limitations

[73], [74]. For example, unvoiced speech signal is not

periodic, and the manipulation of time-domain signal not

straightforward.

Harmonic plus Noise Model (HNM) represents another

signal-based representation approach. It works under the

assumption that a speech signal can be represented as

a harmonic component plus a noise component that is

delimited by the so-called maximum voiced frequency

[75]. The harmonic component is modeled as the sum of

harmonic sinusoids up to the maximum voiced frequency,

while the noise component is modeled as Gaussian noise

ﬁltered by a time-varying autoregressive ﬁlter. As HNM

decomposition is represented by some controllable param-

eters, it allows for easy modiﬁcation speech [76], [77].

2) Model-based Representation: The model-based tech-

nique assumes that the input signal can be mathematically

represented by a model whose parameters vary with time.

A typical example is the source-ﬁlter model that represents

a speech signal as the outcome of an excitation of the

larynx (source) modulated by a transfer (ﬁlter) function

determined by the shape of the supralaryngeal vocal tract. A

vocoder, a short form of voice coder, was initially developed

to minimize the amount of data that are transmitted for

voice communication. It encodes speech into slowly chang-

ing control parameters, such as linear predictive coding

and mel-log spectrum approximation [78], that describe the

ﬁlter, and re-synthesizes the speech signal with the source

information at the receiving end. In voice conversion, we

convert the speech signals from a source speaker to mimic

the target speaker by modifying the controllable parame-

ters.

The majority of vocoders are designed based on some

form of the source-ﬁlter model of speech production, such

as mixed excitation with a spectral envelope, and glottal

vocoders [79]. STRAIGHT or “Speech Transformation and

Representation using Adaptive Interpolation of weiGHTed

spectrum" is one of the popular vocoders in speech synthe-

sis and voice conversion [80]. It decomposes a speech signal

into: 1) a smooth spectrogram which is free from periodicity

in time and frequency; 2) a fundamental frequency (F0)

contour which is estimated using a ﬁxed-point algorithm;

and 3) a time-frequency periodicity map which captures

the spectral shape of the noise and its temporal envelope.

STRAIGHT is widely used in voice conversion because its

parametric representation facilitates the statistical modeling

of speech, that allows for easy manipulation of speech [11],

[81], [82].

Parametric vocoders are widely adopted for analysis and

reconstruction of speech in voice conversion studies [8],

[9], [11], [12], [46], [47], [83], [84], and continue to play a

major role today [17], [21], [22]. The traditional parametric

vocoders are designed to approximate the complex me-

chanics of the human speech production under certain sim-

pliﬁed assumptions. For example, the interaction between

F0 and formant structure is ignored, and the original phase

structure is discarded [85]. The assumption of stationary

process in the short-time window, and time-invariant linear

ﬁlter, also give rise to “robotic” and “buzzy” voice. Such

problems become more serious in voice conversion as we

modify both F0 and the formant structure of speech among

others at the same time. We believe that vocoding can

be improved by considering the interaction between the

parameters.

3) WaveNet Vocoder: Deep learning offers a solution to

some of the inherent problems of parametric vocoders.

WaveNet [53] is a deep neural network that learns to

generate high quality time-domain waveform. As it doesn’t

assume any mathematical model, it is a data-driven solu-

tion that requires a large amount of training data.

The joint probability of a waveform X=x1,x2,...,xNcan

be factorized as a product of conditional probabilities.

p(X)=

n=1

p(xn|x1,x2,...,xn−1) (4)

A WaveNet is constructed with many residual blocks, each

of which consists of 2 ×1 dilated causal convolutions,

a gated activation function and 1 ×1 convolutions. With

additional auxiliary features h, WaveNet can also model

conditional distribution p(x|h) [53]. Eq. (4) can then be

written as follows:

p(X|h)=

n=1

p(xn|x1,x2,...,xn−1,h) (5)

A typical parametric vocoder performs both analysis and

reconstruction of speech. However, most of today’s WaveNet

vocoders only cover the function of speech reconstruction.

It takes some intermediate representations of speech as

the input auxiliary features, and generate speech wave-

form as the output. WaveNet vocoder [55] outperforms

remarkably the traditional parametric vocoders in terms

of sound quality. Not only can it learn the relationship

between input features and output waveform, but also it

learns the interaction among the input features. It has been

successfully adopted as part of the state-of-the-art speech

synthesis [3], [86]–[89] and voice conversion [54], [55], [57],

[60]–[62], [86], [90]–[97] systems.

There have been promising studies on using vocoding

parameters as the intermediate representations in WaveNet

vocoding. A speaker independent WaveNet vocoder [55] is

studied by utilizing the STRAIGHT vocoding parameters,

such as F0, aperiodicity, and spectrum as the inputs of

WaveNet. In this way, WaveNet learns a sample-by-sample

correspondence between the time-domain waveform and

the input vocoding parameters. When such a WaveNet

vocoder is trained on speech signals from a large speaker

population, we obtain a speaker independent vocoder [55].

By adapting the speaker independent WaveNet vocoder

with speaker speciﬁc data, we obtain a speaker dependent

vocoder that generates personalized voice output [58], [60].

The study on WaveNet vocoder also opens up opportu-

nities for the use of other non-vocoding parameters as

the input. For example, a recent study adopts phonetic

posteriogram (PPG) in WaveNet vocoding with promising

results in voice conversion with non-parallel training data

[94]–[97]. Another study adopts latent code of autoencoder

and speaker embedding as the speech representation for

WaveNet vocoder [98].

4) Recent Progress on Neural Vocoders: More recently,

speaker independent WaveRNN-based neural vocoder [63]

became popular as it can generate human-like voices from

both in-domain and out-of-domain spectrogram [99]–[101].

Another well-known neural vocoder that achieves high-

quality synthesis performance is WaveGlow [64]. WaveGlow

is a ﬂow-based network capable of generating high quality

speech from mel-spectrogram [102]. WaveGlow beneﬁts

from the best of Glow and WaveNet so as to provide fast,

efﬁcient and high-quality audio synthesis, without the need

for auto-regression. We note that WaveGlow is implemented

using only a single network with a single cost function, that

is to maximize the likelihood of the training data, which

makes the training procedure simple and stable [103].

WaveNet [53] uses an auto-regressive (AR) approach to

model the distribution of waveform sampling points, that

incurs a high computational cost. As an alternative to auto-

regression, a neural source-ﬁlter (NSF) waveform modeling

framework is proposed [104], [105]. We note that NSF is

straightforward to train and fast to generate waveform. It

is reported 100 times faster than WaveNet vocoder, and

yet achieving comparable voice quality on a large speech

corpus [106].

B. Feature Extraction

With speech analysis, we derive vocoding parameters

that usually contains spectral and prosodic components

to represent the input speech. The vocoding parameters

characterize the speech in a way that we can reconstruct the

speech signal later on after transmission. This is particularly

important in speech communication. However, such vocod-

ing parameters may not be the best for transformation of

voice identity. More often, the vocoding parameters are fur-

ther transformed into speech features, that we call feature

extraction in Figure 1, for more effective modiﬁcation of the

acoustic properties in voice conversion.

For the spectral component, feature extraction aims

to derive low-dimensional representations from the high-

dimensional raw spectra. Generally speaking, the spectral

features are be able to represent the speaker individuality

well. The feature not only ﬁt the spectral envelope well,

but also be able to be converted back to spectral envelope.

They should have good interpolation properties that allow

for ﬂexible modiﬁcation.

The magnitude spectrum can be warped to Mel or Bark

frequency scale that are perceptually meaningful for voice

conversion. It can also be transformed into cepstral domain

using a ﬁnite number of coefﬁcients using the Discrete

Cosine Transform of log-magnitude. Cepstral coefﬁcients

are less correlated. In this way, high dimension magnitude

spectrum is transformed to lower dimension feature rep-

resentation. The commonly used speech features include

Mel-cepstral coefﬁcients (MCC), linear predictive cepstral

coefﬁcients (LPCC), and line spectral frequencies (LSF).

Typically, a speech frame is represented by a feature vector.

Short-time analysis has been the most practical way

of speech analysis. Unfortunately it inherently ignores the

temporal context of speech, that is crucial in voice conver-

sion. Many studies have shown that multiple frames [18],

[107], dynamic features [62], and phonetic segments serve

as effective features in feature mapping.

For the prosodic component, feature extraction can be

used to decompose prosodic signal, such as fundamental

frequency (F0), aperiodicity (AP), and energy contours,

into speaker dependent and independent parameters [82].

In this way, we can carry over the speaker independent

prosodic patterns, while converting speaker dependent

ones during the feature mapping.

C. Feature Mapping

In the typical ﬂow of voice conversion, feature mapping

performs the modiﬁcation of speech features from source

to target speaker. Spectral mapping seeks to change the

voice timbre, while prosody conversion seeks to modify the

prosody features, such as fundamental frequency, intona-

tion and duration. So far, spectral mapping remains the

center of many voice conversion studies.

During training, we learn the mapping function, F(·)

in Eq.(1), from training data. At run time inference, the

mapping function transforms the acoustic features. A large

part of this paper is devoted to the study of the mapping

function. In Section III, we will discuss the traditional

statistical modeling techniques with parallel training data.

In Section IV, we will review the statistical modeling tech-

niques that do not require parallel training data. In Section

V, we will introduce a number of deep learning approaches,

which includes 1) parallel training data of paired speakers;

and 2) beyond parallel data of paired speakers.

III. STATISTICAL MODELING FOR VOICE CONVERSION WITH

PARALLEL TRAINING DATA

Most of the traditional voice conversion techniques as-

sume availability of parallel training data. In other words,

the mapping function is trained on paired utterances of

the same linguistic content spoken by source and target

speaker. Voice conversion studies started with statistical

approaches [108] in late 1980s, that can be grouped into

parametric and non-parametric mapping techniques. Para-

metric techniques makes assumptions about the under-

lying statistical distributions of speech features and their

mapping. Non-parametric ones make fewer assumptions

about the data, but seek to ﬁt the training data with the

best mapping function, while maintaining some ability to

generalize to unseen data.

Parametric techniques, such as Gaussian mixture model

(GMM) [109], Dynamic Kernel Partial Least Square Regres-

sion, PSOLA mapping technique [73], represent a great

success in the recent past. The vector quantization ap-

proach to voice conversion is a typical non-parametric

technique. It maps codewords between source and target

codebooks [8]. In this method, a source feature vector

is approximated by the nearest codeword in the source

codebook, and mappped to the corresponding codeword

in the target codebook. To reduce the quantization error,

fuzzy vector quantization was studied [9], [110], where

continuous weights for individual clusters are determined

at each frame according to the source feature vector. The

converted feature vector is deﬁned as a weighted sum of

the centroid vectors of the mapping codebook. Recently,

the non-negative factorization approach marks a successful

non-parametric implementation.

We will discuss a typical frame-level mapping paradigm

under the assumption of parallel training data, as illustrated

in Figure 2. During the training phase, given parallel train-

ing data from a source speaker xand a target speaker y,

frame alignment is performed to align the source speech

vectors and target speech vectors to obtain the paired

speech feature vector z={x,y}. Dynamic time warping

is feature-based alignment technique that is commonly

used. Speech recognizer, that is equipped with phonetic

knowledge, can also be used to perform model-based align-

ment. Frame alignment has been well studied in speech

processing. In voice conversion, a large body of literature

has been devoted to the design of frame-level mapping

function.

A. Gaussian Mixture Models

In Gaussian mixture modeling (GMM) approach to voice

conversion [109], we represent the relationship between

two sets of spectral envelopes, from source and target

speakers, using a Gaussian mixture model. The Gaussian

mixture model is a continuous parametric function, that

is trained to model the spectral mapping. In [109], har-

monic plus noise (HNM) features are used in the feature

mapping, which allows for high-quality modiﬁcations of

speech signals. The GMM approach is seen as an extension

to the vector quantization approach [8], [9], that results

in improved voice quality. However, the speech quality

is affected by some factors, e.g., spectral movement with

inappropriate dynamic characteristics caused by the frame-

by-frame conversion process, and excessive smoothing of

converted spectra [111]–[113].

To address the frame-by-frame conversion issue, a maxi-

mum likelihood estimation technique was studied to model

the spectral parameter trajectory [11]. This technique aims

to estimate an appropriate spectrum sequence using dy-

namic acoustic features. To address the over-smoothing

issue, or the mufﬂed effect, joint density Gaussian mixture

model (JD-GMM) was studied [2], [11] to jointly model the

sequences of spectral features and their variances using

maximum likelihood estimation, that increases the global

variance of the spectral features. The JD-GMM method in-

volves two phases: off-line training and run-time conversion

phases. During the training phase, Gaussian mixture model

(GMM) is adopted to model the joint probability density

p(z) of the paired feature vector sequence z={x,y}, which

represents the joint distribution of source speech xand

target speech y:

p(z)=

k=1

w(z)

kN³z|µz

k,Σ(z)

k´(6)

µz

k=





µx

µy





,Σ(z)







Σ(xx )

kΣ(x y)

Σ(yx)

kΣ(y y)







where Kis the number of Gaussian components, µz

kand

Σ(z)

kare the mean vector and the covariance matrix of the

kth Gaussian component N³z|µz

k,Σ(z)

k´, respectively. To es-

timate the model parameters of the JD-GMM, expectation-

maximization (EM) algorithm [114]–[117] is used to maxi-

mize likelihood on the training data.

A post-ﬁlter based on modulation spectrum modiﬁcation

is found useful to address the inherent over-smoothing

issue in statistical modeling [118], such as GMM approach,

which effectively compensates the global variance. The

GMM approach is a parametric solution [119]–[123]. It

represents a successful statistical modeling technique that

works well with parallel training data.

B. Dynamic Kernel Partial Least Squares

The family of parametric techniques also include linear

[73], [74] or non-linear mapping functions. With the local

mapping functions, each frame of speech is typically trans-

formed independently from the neighboring frames, which

causes temporal discontinuities to the output [74].

To take into account the time-dependency between

speech features, a dynamic kernel partial least squares

(DKPLS) technique was studied [15]. This method is based

on a kernel transformation of the source features to allow

non-linear modeling, and concatenation adjacent frames to

model the dynamics. The non-linear transformation takes

advantage of the global properties of the data that GMM

approach doesn’t. It was reported that DKPLS outperforms

GMM approach [109] in terms of voice quality. This method

is simple and efﬁcient, and does not require massive tuning.

More recently, DKPLS-based approaches are studied to

overcome the over-ﬁtting and over-smoothing problems by

feature combination strategy [124].

While statistical modeling for the mapping of spectral

features has been well studied, conversion of prosody is

Reconstruction

Frame Alignment

Source Speech

Target Speech

Converted Speech

GMM

Frequency

Warping

NMF

DKPLS

Frame-level Mapping

Mapping

Analysis &

Feature Extraction

Source Speech

TRAINING PHASE

RUN-TIME CONVERSION PHASE

DNN

LSTM

(2) Deep Learning(1) Statistical Modeling

Fig. 2: Training and run-time inference of voice conversion with parallel training data under the frame-level mapping

paradigm. The pink boxes represent the training algorithms of the models that result in the mapping function F(x) in

blue box for run-time inference. Dotted box (1) includes examples of statistical approaches, and (2) includes examples

of deep learning approaches.

often achieved by simply shifting and scaling F0, which is

not sufﬁcient for high-quality voice conversion. Hierarchical

modeling of prosody, for different linguistic units at several

distinct temporal scales, represents an advanced technique

for prosody conversion [82], [125]–[127]. DKPLS has cre-

ated a platform for multi-scale prosody conversion through

wavelet transform [128] that shows signiﬁcant improvement

in naturalness over the F0 shifting and scaling technique.

C. Frequency Warping

Parametric techniques, such as GMM [109] and DKPLS

[15], usually suffer from over-smoothing because they use

the minimum mean square error [81] or the maximum

likelihood [11] function as the optimization criterion. As a

result, the system produces acoustic features that represent

statistical average, and fails to capture the desired details

of temporal and spectral dynamics.

Additionally, parametric techniques generally employ

low-dimensional features, as discussed in Section II.B, such

as the Mel cepstral coefﬁcients (MCC) or line spectral

frequencies (LSF) to avoid the curse of dimensionality. The

low dimensional features, however, are doomed to lose

spectral details because they have low-resolution. Statistical

averaging and low-resolution features both lead to the

mufﬂed effect of output speech [129].

To preserve the necessary spectral details during con-

version, a number of frequency warping-based methods

were introduced. The frequency warping technique directly

transforms the high resolution source spectrum to that of

the target speaker through a frequency warping function. In

recent literature, the warping function is either realized by

a single parameter, such as VTLN-based approaches [26],

[130]–[133], or represented as a piecewise linear function

[73], [129], [134], which has become a mainstream solution.

The goal of piecewise linear warping function is to align a

set of frequencies between the source and target spectrum

by minimizing the spectral distance or maximizing the

correlation between the converted and target spectrum.

More recently, the parametric frequency warping technique

was incorporated with a non-parametric exemplar-based

technique, that achieves good performance [107].

D. Non-negative Matrix Factorization

Non-negative matrix factorization (NMF) [135] is an ef-

fective data mining technique that has been widely used,

especially for reconstruction of high quality signals, such

as in speech enhancement [136], [137], speech de-noising

[138], [139], noise and speech estimation [140]. It factorizes

a matrix into two matrices, a dictionary and an activation

matrix, with the property that all three matrices have no

negative elements. The NMF-based techniques are shown

effective in voice conversion with very limited training data.

It marks a major progress of non-parametric approach

to voice conversion since vector quantization technique

was introduced. Successful implementation includes non-

negative spectrogram deconvolution [141], locally linear

embedding (LLE) [142], and unit selection [20]. In NMF-

based approaches, a target spectrogram is constructed

as a linear combination of exemplars. Therefore, over-

smoothing problem can also arise. To overcome the over-

smoothing problem, several effective techniques were de-

veloped, that we summarize next.

1) Sparse Representation: One effective way to alleviate

the over-smoothing problem is to apply sparsity constraint

to the activation matrix, referred to as exemplar-based

sparse representation.

As illustrated in Figure 3, a pair of dictionaries Aand B

are ﬁrst constructed from speech feature vectors, that we

call aligned exemplars, from source and target. [A;B] is also

called the coupled dictionary. At run-time, let’s consider a

speech utterance as a sequence of speech feature vectors,

Source

Spectrogram

Converted

Spectrogram

Source

Dictionary

Target

Dictionary

COPY

Fixed Dictionaries

Fig. 3: Illustration of non-negative matrix factorization for

exemplar-based sparse representation.

that form a spectrogram matrix. The matrix of a source

utterance Xcan be represented as,

X≈Aˆ

H(7)

Due to the non-negative nature of spectrogram, NMF tech-

nique is employed to estimate the source activation matrix

H, which is constrained to be sparse. Mathematically, we

estimate ˆ

Hby minimizing an objective function,

H=argmin

H≥0

d¡X,AH¢+λ||H|| (8)

where λis the sparsity penalty factor. To estimate activation

matrix ˆ

H, a generalised Kullback-Leibler (KL) divergence is

used. It is assumed that source and target dictionaries A

and Bcan share the same source activation matrix ˆ

Therefore, the converted spectrogram for the target

speaker can be written as,

Y=Bˆ

H. (9)

where the activation matrix ˆ

Hserves as the pivot to transfer

source utterance Xto target utterance Y.

The sparse representation framework continues to attract

much attention in voice conversion. The recent studies

include its extension to discriminative graph-embedded

NMF approach [19], phonetic sparse representation for

spectrum conversion [22], and its application to timbre and

prosody conversion [143], [144].

2) Phonetic Sparse Representation: As the frame-level

mapping is done at acoustic feature level, the coupled

dictionary [A;B] is therefore called acoustic dictionary.

With the scripts of the training data and a general purpose

speech recognizer, we are able to obtain phonetic labels

and their boundaries. Studies have shown that the strat-

egy of dictionary construction plays an important role in

voice conversion [145]. The idea of selecting sub-dictionary

according to the run-time speech content shows improved

performance [21].

Phonetic sparse representation [22] is an extension to

sparse representation for voice conversion. It is built on

the idea of phonetic sub-dictionaries, and dictionary selec-

tion at run-time. The study shows that multiple phonetic

sub-dictionaries consistently outperform single dictionary

in exemplar-based sparse representation voice conversion

[21], [22]. However, the phonetic sparse representation relies

on a speech recognizer at run-time to help select the sub-

dictionary.

3) Group Sparse Representation: Sisman et al. [62] pro-

posed group sparse representation to formulate both

exemplar-based sparse representation [141], and phonetic

sparse representation [22] under an uniﬁed mathematical

framework. With the group sparsity regularization, only

the phonetic sub-dictionary that is relevant to the input

features is likely to be activated at run-time inference. Un-

like phonetic sparse representation that relies on a speech

recognizer for both training and run-time inference, group

sparse representation only requires the speech recognizer

during training when we build the phonetic dictionary. It

was reported that group sparse representation provides sim-

ilar performance to that of phonetic sparse representation

when performing both spectrum and prosody conversion

[62].

IV. STATI ST IC AL MODELING FOR VOICE CONVERSION WITH

NON -PAR AL LE L TRAINING DATA

It is easy to understand that it is more straightforward

to train a mapping function from parallel than non-parallel

training data. However, parallel training data are not always

available. In real-world applications, there are situations

where only non-parallel data are available. Intuitively, if we

can derive the equivalents of speech frames or segments

between speakers from non-parallel data, we are able to

establish or to reﬁne the mapping function using the con-

ventional linear transformation parameter training, such as

GMM, DKPLS or frequency warping.

There were a number of attempts to do so. For example,

one idea is to ﬁnd source-target mapping between unsu-

pervised feature clusters [146]. Another is to use a speech

recognizer to index the target training data so that we can

retrieve similar frames from target database for a unknown

source frame at run-time [147]. Unfortunately, each of the

steps may produce errors that accumulate and may lead to

a poor parameter estimation [146]. There was also a study

to use a hidden Markov model (HMM) that is trained for the

target speaker, then the parameters of GMM-based linear

transformation function are estimated in such a way that

the converted source vectors exhibit maximum likelihood

with respect to the target HMM [148]. This method shows

comparable performance with methods of parallel data.

However, it requires that the orthography of the training

utterances be known, that limits its use.

Next we will discuss three clusters of studies and their

representative work, 1) INCA algorithm, 2) unit selection

algorithm, and 3) speaker modeling algorithm.

A. INCA Algorithm

INCA refers to an Iterative combination of a Nearest

Neighbor search step and a Conversion step Alignment

method [27]. It learns a mapping function by ﬁnding

the nearest neighbor of each source vector in the target

Non-parallel

Speech Data

GMM DKPLS ...

INCA

Algorithm

Fig. 4: The training of a frame-level mapping function is an

iterative process between the nearest neighbor search step

(INCA alignment) and the conversion step (a parametric

mapping function).

acoustic space. It is based on a hypothesis that an iter-

ative reﬁnement of the basic nearest neighbour method,

in tandem with the voice conversion system, would lead

to a progressive alignment improvement. The main idea is

that the intermediate voice, xk

s, obtained after the previous

nearest neighbour alignment can be used as the source

voice during the next iteration.

xk+1

s=Fk(xk

s) (10)

During training, the optimization process is repeated until

the current intermediate voice, xk

s, is close enough to

target voice, yt. INCA represents a successful framework for

the non-parallel training data problem, where the nearest

neighbor search step (INCA alignment) and the conversion

step (a parametric mapping function) iterates to optimize

the mapping function, as illustrated in Figure 4.

INCA was ﬁrst implemented with GMM approach [109]

for voice conversion to estimate a linear mapping func-

tion. As INCA does not require any phonetic or linguistic

information, it not only works for non-parallel training

data, but also works for cross-lingual voice conversion.

Experiments show that the INCA implementation of a cross-

lingual system achieves similar performance to its intra-

lingual counterpart that is trained on parallel data [27].

INCA was further implemented with DKPLS approach

[15] that was discussed in Section III.B for parallel training

data. The idea [30] is to use the INCA alignment algorithm

[27] to ﬁnd the corresponding frames from the source and

target datasets, that allows the DKPLS regression to ﬁnd a

non-linear mapping between the aligned datasets. It was re-

ported [30] that the INCA-DKPLS implementation produces

high-quality voice that is comparable to implementation

with parallel training data on the same amount of training

data.

Source Features Target Features

Target Speaker

Database

Dynamic

Programming

Fig. 5: Run-time inference of unit selection algorithm that

doesn’t model a mapping function with parameters, but

rather searches for output feature sequence directly from

target speaker database, and optimizes the output at utter-

ance level.

B. Unit Selection Algorithm

Unit selection algorithm has been widely used to generate

natural-sounding speech in speech synthesis. It is known

to produce high speaker similarity and voice quality [75],

[149], [150] because the synthesized waveform is formed

of sound units directly from the target speaker [151]. The

unit selection algorithm optimizes the unit selection from

a voice inventory of a target speaker. It was suggested

[152] to make use of unit selection synthesis system to

generate parallel versions of the training sentences from

non-parallel data. With the resulting pseudo-parallel data,

the statistical modeling techniques for parallel training data,

that we discuss in Section III, can be readily applied. While

this approach produces satisfactory voice quality [152], it

requires a large speech database to develop the the voice

inventory, that is not always practical in reality.

Another idea is to follow what we do in unit selection

speech synthesis by deﬁning a speech feature vector as a

unit [24]. Given an utterance of Mspeech feature vectors

X={x1,x2,...,xM} from the source speaker, a dynamic pro-

gramming is applied to ﬁnd the sequence of feature vectors

yifrom the target speaker, that minimizes a cost function,

Y=argmin

y³α

i=1

d1(xi,yi)+(1−α)

i=2

d2(yi,yi−1)´(11)

where d1(·) represents the acoustic distance between a

source and a target feature vector, while d2(·) is the con-

catenative cost between two target feature vectors. With

the acoustic distance, we make sure that the retrieved

speech features from the target speakers are close to those

of the source; with the concatenative cost, we encourage

the consecutive speech frames from the target speaker

database to be retrieved together in a multi-frame segment.

As illustrated in Figure 5, unit selection algorithm is a non-

parametric solution because we don’t model the conver-

sion with parameters. It optimizes the output by applying

a dynamic programming to ﬁnd the best feature vector

sequence from the target speaker database. The mapping

function Y=F(X) is deﬁned by the cost function Eq.11 itself,

and optimized at the utterance level.

C. Speaker Modeling Algorithm

The techniques for text-independent speaker character-

ization are readily available for non-parallel training data,

where a speaker can be modeled by a set of parameters,

such as a GMM or i-vector. One is possible to make use

such speaker models to perform voice conversion.

Mouchtaris et al. [153] used a GMM-based technique to

model relationship between reference speakers in advance

and apply the relationship for a new speaker. Toda et

al. [154] proposed an eigenvoice approach that performs

two mappings, one to map from the source speaker to

an eigenvoice (or average voice) trained from reference

speakers, and another from the eigenvoice to the target

speaker. These approaches don’t require parallel training

data, they do require parallel data from some reference

speakers.

In speaker veriﬁcation, the joint factor analysis method

[155] decomposes a supervector into speaker independent,

speaker dependent and channel dependent components,

each of which is represented by a low-dimensional set of

factors. This aims to disentangle speaker from other speech

content for effective speaker veriﬁcation. Inspired by this

idea, we argue [156] that similar decomposition would be

useful in voice conversion, where we would like to separate

speaker information from the linguistic content, and apply

factor analysis on the speaker speciﬁc component.

With factor analysis, the speaker speciﬁc component

can be represented by a low-dimensional set of latent

variables via the factor loadings. One of the ideas [156] is

to estimate the phonetic component and factor loadings

from non-parallel prior data. In this way, during the training

process, we only estimate a low-dimensional set of speaker

identity factors and a tied covariance matrix instead of

a full conversion function from the source-target parallel

utterances. Even though parallel utterances are still required

for estimating the conversion function, the use of prior

data allows us to obtain a reliable model from much fewer

training samples than those required by conventional JD-

GMM [157].

Another idea is to perform the voice conversion in

i-vector [155] speaker space, where i-vector is used to

disentangle a speaker from the linguistic content. The

primary motivation is that an i-vector can be extracted in

an unsupervised manner regardless of speaker or speech

content, which opens up new possibilities especially for

non-parallel data scenarios where source and target speech

is of different content or even in different languages [28],

[45], [158]. Kinnunen et al. [159] studies a way to shift the

acoustic features of input speech towards target speech in

the i-vector space. The idea is to learn a function that maps

the i-vector of the source utterance to that of the target.

With the mapping function, we are able to convert the

source speech frame-by-frame to the target. This technique

is free of any parallel data, and text transcription.

V. DEEP LEARNING FOR VOICE CONVERSION

Voice conversion is typically a research problem with

scarce training data. Deep learning techniques are typi-

cally data driven, that rely on big data. However, this is

actually the strength of deep learning in voice conver-

sion. Deep learning opens up many possibilities to beneﬁt

from abundantly available training data, so that the voice

conversion task can focus more on learning the mapping

of speaker characteristics. For example, it shouldn’t be

the job of voice conversion task to infer low level detail

during speech reconstruction, a neural vocoder can learn

from large database to do so [98]. It shouldn’t be a task

of voice conversion to learn how to represent an entire

phonetic system of a spoken language, a general purpose

acoustic model of neural ASR [160] or TTS [161] system

can learn from a large database to do so. By leveraging

the large database, we free up the conversion network

from using its capacity to represent low level detail and

general information, but instead, to focus on the high level

semantics necessary for speaker identity conversion.

Deep learning techniques also transform the way we im-

plement the analysis-mapping-reconstruction pipeline. For

effective mapping, we need to derive adequate intermediate

representation of speech, that was discussed in Section II.

The concept of embedding in deep learning provides a

new way of deriving the intermediate representation, for

example, latent code for linguistic content, and speaker

embedding for speaker identity. It also makes the disen-

tanglement of speaker from content much easier.

In this section, we will summarize how deep learning

helps address existing research problems, such as parallel

and non-parallel data voice conversion. We will also review

how deep learning breaks new ground in voice conversion

research.

A. Deep Learning for Frame-Aligned Parallel Data

The study on deep learning approaches for voice con-

version started with parallel training data, where we use

a neural network as an improved regression function to

approximate the mapping function y=F(x) under the

frame-level mapping paradigm in Figure 2.

1) DNN Mapping Function: The early studies on DNN-

based voice conversion methods are focused on spectral

transformation. DNN mapping function, y=F(x), has some

clear advantage over other statistical models, such as GMM,

and DKPLS. For instance, it allows for non-linear mapping

between source and target features, and there is little

restriction to the dimension of features to be modeled. We

note that conversion on other acoustic features, such as

fundamental frequency and energy contour, can also be

done similarly [162].

Desai et al. [81] proposed a DNN to map a low-

dimensional spectral representation, such as mel-cepstral

coefﬁcients (MCEP), from source to target speaker.

Nakashika et al. [163] proposed to use Deep Belief Nets

(DBNs) to extract latent features from source and target

cepstrum coefﬁcients, and use a neural network with one

hidden layer to perform conversion between latent features.

Mohammadi et al. [164] furthered the idea by studying

a deep autoencoder from multiple speakers to derive a

compact representations of speech spectral feature. High-

dimensional representation of spectrum has also been used

in a more recent work [165] for spectral mapping, together

with dynamic features and a parameter generation algo-

rithm [166]. Chen et al. [167] proposed to model the distri-

butions of spectral envelopes of source and target speakers

respectively through a layer-wise generative training.

Generally speaking, DNN for spectrum and/or prosody

transformation requires a large amount of parallel training

data from paired speakers, which is not always feasible. But

it opens up opportunities for us to make use of speech data

from multiple speakers beyond source and target, to better

model the source and the target speakers, and to discover

better feature representations for feature mapping.

2) LSTM Mapping Function: To model the temporal

correlation across speech frames in voice conversion,

Nakashika et al. [168] explore the use of Recurrent Tem-

poral Restricted Boltzmann Machines (RTRBM), a type of

recurrent neural networks. The success of Long-Short Term

Memory (LSTM) [169], [170] in sequence to sequence mod-

eling inspires the study of LSTM in voice conversion, which

leads to an improvement of naturalness and continuity of

the speech output.

The LSTM network architecture consists of a set of

memory blocks and peephole connections, that support

the storage and access to long-range contextual information

[171] in linear memory cells. It learns the optimal amount of

contextual information for voice conversion. A bidirectional

LSTM (BLSTM) network is expected to capture sequential

information and maintain long-range contextual features

from both forward sequence and backward sequence [45].

Sun et al. [40] and Ming et al. [172] proposed a deep

bidirectional LSTM network (DBLSTM) by stacking multiple

hidden layers of BLSTM network architecture, that is shown

to outperform DNN voice conversion even without using

dynamic features. While DBLSTM-based voice conversion

approach generates high-quality synthesized voice, it typ-

ically requires a large speech corpus from source and

target speakers for training, that limits the scope of the

applications in practice [40].

Just like GMM approach, DNN and LSTM techniques rely

on external frame aligner during training data preparation,

as illustrated in Figure 2. At run-time, the conversion

process follows the typical ﬂow of 3-step pipeline, and

doesn’t change the speech duration during the conversion.

B. Encoder-decoder with Attention for Parallel Data

The research problems of voice conversion are centered

around alignment and mapping, which are interrelated both

during training and at run-time inference, as illustrated in

Figure 2. During training, more accurate alignment helps

build better mapping function, that explains why we prefer

parallel training data. At run-time inference, the frame-

level mapping paradigm doesn’t change the duration of the

speech during the conversion. While it is possible to model

and predict the duration for voice conversion output, it

is not straightforward to incorporate duration model and

AttentionEncoder Decoder

Source

Speech

Converted

Speech

Fig. 6: Encoder-decoder mechanism with attention for voice

conversion.

mapping model in a systematic manner. Deep learning

provides a new solution to this research problem.

The attention mechanism [173], [174] in encoder-decoder

structure neural network brings about a paradigm change.

The idea of attention was ﬁrst successfully used in machine

translation [173], speech recognition [175], and sequence-

to-sequence speech synthesis [86], [176]–[178], that led to

many parallel studies in voice conversion [179]–[181]. With

the attention mechanism, the neural network learns the

feature mapping and alignment at the same time during

training. At run-time inference, the network automatically

decides the output duration according to what it has learnt.

In other words, the frame-aligner in Figure 2 is no longer

required.

There are several variations based on recurrent neural

networks, such as SCENT [179], and AttS2S-VC [181]. They

follow the widely-used architecture of encoder-decoder with

attention [180], [182]. Suppose that we have a source speech

x={x1,x2,...,xTs}. The encoder network ﬁrst transforms

the input feature sequences into hidden representations,

h={h1,h2,...,hTh} at a lower frame rate with Th<Ts, which

are suitable for the decoder to deal with. At each decoder

time step, the attention module aggregates the encoder

outputs by attention probabilities and produces a context

vector. Then, the decoder predicts output acoustic features

frame by frame using context vectors. Furthermore, a post-

ﬁltering network is designed to enhance the accuracy of

the converted acoustic features to generate the converted

speech y={y1,y2,...,yTy}. During training, the attention

mechanism learns the mapping dynamics between source

sequence and target sequence. At run-time inference, the

decoder and the attention mechanism interacts to perform

the mapping and alignment at the same time. The overall

architecture is illustrated in Figure 6.

While recurrent neural networks represent an effective

implementation for sequence-to-sequence conversion, re-

cent studies have shown that convolutional neural networks

with gating mechanisms also learn well the long-term

dependencies [53], [183]. It employs an attention mecha-

nism that effectively makes possible parallel computations

for encoding and decoding. During decoding, the causal

convolution design allows the model to generate an output

sequence in an autoregressive manner. Kameoka et al. pro-

posed a convolutional neural networks implementation for

voice conversion [184], that is called ConvS2S-VC. Recent

studies show that ConvS2S-VC outperforms its recurrent

neural network counterparts in both pairwise and many-

Generator





→



Target Original

Features

Source Original

Features

Discriminator

Generator





→



Generator





→



Generator





→



Source Converted

Features Target Converted

Features

L1 L1

Converted

Features

Real, or not?

Discriminator

Converted

Features

Real, or not?

Source Speech

Target Speech

Training

Fig. 7: Training a CycleGAN with cycle-consistency loss of L1 norm for voice conversion with non-parallel training data

of paired speakers. L1 norm represents the least absolute errors

to-many voice conversion [181].

The encoder-decoder structure with attention marks a

departure from the frame-level mapping paradigm. The

attention doesn’t perform the mapping frame-by-frame, but

rather allows the decoder to attend to multiple speech

frames and uses the soft combination to predict an output

frame in the decoding process. With the attention mecha-

nism, the duration of the converted speech Tyis typically

different from that of the source speech Tsto reﬂect the

differences of speaking style between source and target.

This represents a way to handle both spectral and prosody

conversion at the same time. The studies have attributed

the improvement of voice quality to the effective attention

mechanism. The attention mechanism also represents the

ﬁrst step towards relaxing the rigid requirement of parallel

data in voice conversion.

C. Beyond Parallel Data of Paired Speakers

In Section III and IV, we study statistical modeling

for voice conversion with parallel training data and non-

parallel training data. The advent of deep learning has

broken new ground for voice conversion research. We

now go beyond the paradigm of parallel and non-parallel

training data. We refer nonparallel training data to the case

where nonparallel utterances from source-target speakers

are required. However, the recent studies show that, deep

learning has enabled many voice conversion scenarios with-

out the need of parallel data. In this section, we summarize

the studies into four scenarios,

1) Non-parallel data of paired speakers,

2) Leveraging TTS systems,

3) Leveraging ASR systems, and

4) Disentangling speaker from linguistic content.

1) Non-parallel data of paired speakers: Voice conversion

with non-parallel training data is a task similar to image-to-

image translation, which is to ﬁnd a mapping from a source

domain to a target domain without the need of parallel

training data. Let’s draw a parallel between image-to-image

translation and voice conversion. In image translation, we

would like to translate a horse to a zebra, where we preserve

the structure of horse and change the coat of horse to that

of zebra [185]–[190], in voice conversion, we would like to

transform one voice to that of another, while preserving the

linguistic, and prosodic content.

CycleGAN is based on the concept of adversarial learning

[191], which is to train a generative model to ﬁnd a solution

in a min-max game between two neural networks, called as

generator (G) and discriminator (D). It is known to achieve

remarkable results [185] on several tasks where paired

training data does not exist, such as image manipulation

and synthesis [185], [188], [192]–[195], speech enhancement

[196], speech recognition [197], speech synthesis [198],

[199].

As the speech data are non-parallel, alignment is not eas-

ily achieved. Kaneko and Kameoka ﬁrst studied a CycleGAN

[47], [48], [200], [201] that incorporates three loss func-

tions: adversarial loss, cycle-consistency loss, and identity-

mapping loss, to learn forward and inverse mapping be-

tween source and target speakers.

The adversarial loss measures how distinguishable be-

tween the data distribution of converted features and

source features xor target features y. For the forward

mapping, it is deﬁned as follows:

LADV (GX→Y,DY,X,Y)=Ey∼P(y)[DY(y)]

+Ex∼P(x)[l og (1 −DY(GX→Y(x))] (12)

The closer the distribution of converted data with that of

target data, the smaller the loss becomes.

The adversarial loss only tells us whether GX→Yfollows

the distribution of target data and does not ensure that

the contextual information, that represents the general

sentence structure we would like to carry over from source

to target, is preserved. To ensure that we maintain the

consistent contextual information between xand GX→Y(x),

the cycle-consistency loss, that is presented in Figure 7, is

introduced,

LC Y C (GX→Y,GY→X)

=Ex∼P(x)[kGY→X(GX→Y(x)−x)k1]

+Ey∼P(y)[kGX→Y(GY→X(y)−y)k1] (13)

where k · k1refers to a L1 norm function, or least absolute

errors, that is known to produce sharper spectral features.

This loss encourages GX→Yand GY→Xto ﬁnd an optimal

pseudo pair of (x,y) through circular conversion.

To encourage the generator to ﬁnd the mapping that

preserves underlying linguistic content between the input

and output [202], an identity mapping loss is introduced as

follows,

LI D (GX→Y,GY→X)

=Ex∼P(x)[kGY→X(x)−xk]+Ey∼P(y)[kGX→Y(y)−yk] (14)

Combining the three loss functions, we have the total

loss as,

L(G,F,DX,DY,X,Y)=LGA N (G,DY,X,Y)

+LG AN (F,DX,X,Y)+λC Y C LCY C (G,F,X,Y)

+λI D LI D (G,F,X,Y) (15)

where λCY C and λI D are trade-off parameters.

The optimal mapping functions G∗and F∗are obtained

by solving the minmax-game deﬁned as:

G∗,F∗=argmin

G,F

max

DX,DY

L(G,F,DX,DY,X,Y) (16)

CycleGAN represents a successful deep learning imple-

mentation to ﬁnd an optimal pseudo pair from non-

parallel data of paired speakers. It doesn’t require any frame

alignment mechanism such as dynamic time warping or

attention. Experimental results show that, with non-parallel

training data, CycleGAN achieves comparable performance

to that of GMM-based system that is trained on twice

amount of parallel data [47]. Moreover, with the adversarial

training, it effectively overcomes the over-smoothing prob-

lem, which is known to be one of the main factors leading

to speech-quality degradation. We note that more recently,

CycleGAN-VC2, an improved version of CycleGAN-VC has

been studied [201], that further improves CycleGAN by

incorporating three new techniques: an improved objective

(two-step adversarial losses), improved generator (2-1-2D

CNN), and improved discriminator (PatchGAN). CycleGAN

has been successfully applied in mono-lingual [48], [203],

cross-lingual voice conversion [204], emotional voice con-

version [205], [206] and rhythm-ﬂexible voice conversion

[207].

Unlike the encoder-decoder structure, CycleGAN follows

a generative modeling architecture that doesn’t explicitly

model some internal representations to support ﬂexible

manipulation, such as voice identity, duration of speech,

and emotion. Therefore, it is more suitable for voice conver-

sion between a speciﬁc source and target pair. Nonetheless,

it represents an important milestone towards non-parallel

data voice conversion.

2) Leveraging TTS systems: We have discussed the deep

learning architectures for voice conversion that do not in-

volve text. One of the important aspects of voice conversion

is to carry forward the linguistic content from source to

target. Voice conversion and TTS systems are similar in the

sense that they both aim to generate high quality speech

with the appropriate linguistic content. A TTS system pro-

vides a mechanism for the speech to adhere to the linguistic

content. The ideas to leverage TTS mechanism can be

motivated in different ways. Firstly, a TTS system is trained

on a large speech database that offers a high quality speech

re-construction mechanism given the linguistic content;

AttentionEncoder DecoderText Speech

AttentionEncoder Decoder

Source

Speech

Target

Speech

Shared

Decoder

Shared

Attention

Fig. 8: The upper panel is a TTS ﬂow, and the lower panel

is a voice conversion ﬂow. Both follow similar encoder-

decoder with attention architecture. The voice conversion

leverages the TTS system that is linguistically informed.

secondly, a TTS system is equipped with a quality attention

mechanism that is needed by voice conversion.

As illustrated in Figure 8, encoder-decoder models with

attention have recently shown considerable success in mod-

eling a variety of complex sequence-to-sequence problems.

Tacotron [87], [176], [208] represents one of the successful

text-to-speech (TTS) implementations, that has been ex-

tended to voice conversion [3], [179].

Zhang et al. proposed a joint training system architecture

for both text-to-speech and voice conversion [3] by ex-

tending the model architecture of Tacotron, which features

a multi-source sequence-to-sequence model with a dual

input, and dual attention mechanism. By taking only text

as input, the system performs speech synthesis. The system

can also take either voice alone, or both text and voice

as input for voice conversion. The multi-source encoder-

decoder model is trained with a decoder that is linguisti-

cally informed via the TTS joint training, as illustrated by

shared decoder in Figure 8. Experiments show that the joint

training has improved the voice conversion task with or

without text input at run-time inference.

Park et al. proposed a voice conversion system, known as

Cotatron, that is built on top of a multi-speaker Tacotron

TTS architecture [161]. At run-time inference, the pre-

trained TTS system is used to derive speaker-independent

linguistic features of the source speech. This process is

guided by the transcription of the input speech, as such,

text transcription of source speech is required at run-time

inference. The system uses the TTS encoder to extract

speaker-independent linguistic features, or disentangle the

speaker identity. The decoder then takes the attention-

aligned speaker-independent linguistic features as the in-

put, and the target speaker identity as the condition, to

generate a target speaker’s voice. In this way, voice conver-

sion leverage the attention mechanism or shared attention

from TTS, as shown in Figure 8. Cotatron is designed

to perform one-to-many voice conversion. A study [209],

that shares similar motivation with [161] but is based on

PPG Features

Neural Networks

MCEPs

Neural Networks

Multi-speaker

Target Speaker

PPG Features

MCEPs

Multi-speaker

Average Model

Target Speaker

Mapping Function

Fig. 9: Training phase of the average modeling approach

that maps PPG features to MCEP features for voice conver-

sion [44].

the Transformer instead of Tacotron, suggests transferring

knowledge from a learned TTS model to beneﬁt from large-

scale, easily accessible TTS corpora.

Zhang et al. [210] proposed to improve the sequence-

to-sequence model [179] by using text supervision dur-

ing training. A multi-task learning structure is designed

which adds auxiliary classiﬁers to the middle layers of the

sequence-to-sequence model to predict linguistic labels as

a secondary task. The linguistic labels can be obtained

either manually or automatically with alignment tools. With

the linguistic label objective, the encoder and decoder are

expected to generate meaningful intermediate representa-

tions which are linguistically informed. The text transcripts

are only required during training. Experiments show that

the multi-task learning with linguistic labels effectively

improves the alignment quality of the model, thus alleviates

issues such as mispronunciation.

The neural representation of deep learning has facilitated

the interaction between TTS and voice conversion. By lever-

aging TTS systems, we hope to improve the training and

run-time inference of voice conversion with by adhering

to linguistic content. However, such techniques usually

require a large training corpus. Recent studies introduced a

framework for creating limited-data VC system [209], [211],

[212] by bootstrapping from a speaker-adaptive TTS model.

It deserves future studies as to how voice conversion can

beneﬁt from TTS systems without involving large training

data.

3) Leveraging ASR systems: Deep learning approaches for

voice conversion typically require a large parallel corpus for

training. This is partly because we would like to learn the

latent representations that describe the phonetic systems.

The requirement of training data has limited the scope of

potential applications. We know that most ASR systems are

already trained with a large corpus. They already describe

well the phonetic systems in different ways. The question is

how to leverage the latent representations in ASR systems

for voice conversion.

One of the ideas is to use the context posterior proba-

bility sequence produced by the ASR model with sequence

to sequence learning to generate a target speech feature

sequence [160]. In this modal, the system has an encoder-

decoder structure similar to Figure 6, except that it uses a

speech recognizer as the encoder, and a speech synthesizer

as the decoder. Another study is to guide a sequence to

sequence voice conversion model by an ASR system, which

augments inputs with bottleneck features [179]. Recently,

an end-to-end speech-to-speech sequence transducer, Par-

rotron [213], was studied. Parrotron learns to convert

speech spectrogram of any speakers, with multiple accents

and imperfections, to the voice of a single predeﬁned

target speaker. Parrotron accomplishes this by using an

auxiliary ASR decoder to predict the transcript of the output

speech, conditioned on the encoder latent representation.

The multi-task training of Parrotron optimizes the decoder

to generate the target voice, at the same time, constrains the

latent representation to retain linguistic information only.

The ASR decoder aims to disentangle the speaker’s identity

from the speech. The above techniques adopt the encoder-

decoder with attention architecture.

It is another way to look at voice conversion that speech

consists of two components, speaker dependent compo-

nent and speaker independent component. If we are able to

decompose speech signals into the two components, we can

carry over the former, and only convert the latter to achieve

voice conversion. The average modeling technique repre-

sents one of the successful implementations [41], where we

build a mapping function to convert phonetic posteriogram

(PPG) [32] to acoustic features. The PPG features are derived

from an ASR system, that can be considered as speaker

independent. We train the mapping function from multi-

speaker, non-parallel speech data. In this way, one doesn’t

need to train a full conversion model for each target

speaker. The average model can be adapted towards the

target with a small amount of target speech. The training

and adaptation of the average model are illustrated in

Figure 9.

There were several follow-up studies along this direction,

for example, Tian et al. proposes a PPG to waveform conver-

sion [94], and a average model with speaker identity [155]

as a condition [44]. Zhou et al. proposes to use PPG as the

linguistic features for cross-lingual voice conversion [158].

Liu et al. proposes to use PPG for emotional voice conver-

sion [214]. Zhang et al. also shows that the average model

framework can beneﬁt from a small amount of parallel

training data using an error reduction network [215].

4) Disentangling speaker from linguistic content: In the

context of voice conversion, speech can be considered as

a composition of speaker voice identity and linguistic con-

tent. If we are able to disentangle speaker from the linguistic

content, we can change the speaker identity independently

of the linguistic content. Auto-encoder [216] represents one

of the common techniques for speech disentanglement, and

reconstruction. There are other techniques such as instance

normalization [217] and vector quantization [218], [219],

that are effective in disentangling speaker from the content.

An auto-encoder learns to reproduce its input as its

Content

Encoder

Decoder

Source

Speaker

Speech

Converted

Target

Speech

Speaker

Encoder

Target

Speaker

Speech

Latent

Code

Speaker

Embedding

Fig. 10: A typical auto-encoding network for voice conver-

sion, where the encoders and decoder learn to disentangle

speaker from linguistic content. At run-time, the linguistic

content of the source speech represented by latent code

and speaker embedding of a target speaker are combined

to generate target speech.

output. Therefore, parallel training data is not required. An

encoder learns to represent the input with a latent code,

and a decoder learns to reconstruct the original input from

the latent code. The latent code can be seen as an infor-

mation bottleneck which, on one hand, lets pass informa-

tion necessary, e.g. speaker independent linguistic content,

for perfect reconstruction, and on the other hand, forces

some information to be discarded, e.g. speaker, noise and

channel information [83]. Variational auto-encoder (VAE)

[220] is the stochastic version of auto-encoder, in which the

encoder produces distributions over latent representations,

rather than deterministic latent codes, while the decoder

is trained on samples from these distributions. Variational

auto-encoder is more suitable than deterministic auto-

encoder in synthesizing new samples.

Chorowski et al. [98] provides a comparison of three

auto-encoding neural networks by studying how they learn

a representation from speech data to separate speaker

identity from the linguistic content. It was shown that dis-

crete representation, that is the latent code obtained from

VQ-VAE, preserves the most linguistic content while also

being the most speaker-invariant. Recently, a group latent

embedding technique for VQ-VAE is studied to improve the

encoding process, which divides the embedding dictionary

into groups and uses the weighted average of atoms in the

nearest group as the latent embedding [221].

The concept of a VAE-based voice conversion frame-

work [43] can be illustrated in Figure 10. The decoder

reconstructs the utterance by conditioning on the latent

code extracted by the encoder, and separately on a speaker

code, which could be an one-hot vector [43], [222] for

a close set of speakers, or an i-vector [155], bottleneck

speaker representation [223], or d-vector [224] for an open

set of speakers. By explicitly conditioning the decoder on

speaker identity, the encoder is forced to capture speaker-

independent information in the latent code from a multi-

speaker database.

Just like other auto-encoder, VAE decoder tends to gen-

erate over-smoothed speech. This can be problematic for

voice conversion because the network may generate poor

quality buzzy-sounding speech. Generative adversarial net-

works (GANs) [225] were proposed as one of the solu-

tions to the over-smoothing problem. GANs offer a general

framework for training a data generator in such a way that

it can deceive a real/fake discriminator that attempts to

distinguish real data and fake data produced by the gener-

ator. By incorporating the GAN concept into VAE, VAE-GAN

was studied for voice conversion with non-parallel training

data [46] and in cross-lingual voice conversion [204]. It was

shown that VAE-GAN [225] produces more natural sounding

speech than the standard VAE method [43], [223].

A recent study on sequence-to-sequence non-parallel

voice conversion [226] shows that it is possible to explicitly

model the transfer of other aspects of speech, such as

source rhythm, speaking style, and emotion to the target

speech.

VI. EVALUATIO N OF VOICE CONVERSION

Effective quality assessment of voice quality is required

to validate the algorithms, to measure the technological

progress, and to benchmark a system against the state-of-

the-art. Typically, we report the results in terms of objective

and subjective measurements.

To provide an objective evaluation, a reference speech is

required. The common objective evaluation metrics include

Mel-cepstral distortion (MCD) [227] for spectrum, and PCC

[228] and RMSE [229]–[231] for prosody. We note that, such

metrics are not always correlated with human perception

partly because they measure the distortion of acoustic

features rather than the waveform that humans actually

listen to.

Subjective evaluation metrics, such as the mean opinion

score (MOS) [2], [232]–[234], preference tests [18], [235]

and best-worst scaling [236] could represent the intrinsic

naturalness and similarity to the target. We note that, for

subjective evaluation to be meaningful, a large number of

listeners are required, that is not always possible in practice.

A. Objective Evaluation

1) Spectrum Conversion: To provide an objective evalua-

tion, ﬁrst of all, we need a reference utterance spoken by the

target speaker. Ideally the converted speech is very close to

the reference speech. We can measure the differences be-

tween them by comparing their spectral distances. However,

there is no guarantee that the converted speech and the

reference speech is of the same length. In this case, a frame

aligner is required to establish the frame-level mapping.

Mel-cepstral distortion (MCD) [227] is commonly used to

measure the difference between two spectral features [62],

[237]–[239]. It is calculated between the converted and

target Mel-cepstral coefﬁcients, or MCEPs, [240], [241], ˆ

and y. Suppose that each MCEP vector consists of 24

coefﬁcients, we have ˆ

y={mc

k,i} and y={mt

k,i} at frame k,

where idenotes the ith coefﬁcient in the converted and

target MCEPs.

MC D [d B ]=10

ln10

i=1

(mt

k,i−mc

k,i)2(17)

We note that a lower MCD indicates better performance.

However, MCD value is not always correlated with human

perception. Therefore, subjective evaluations, such as MOS

and similarity score, are also conducted.

2) Prosody Conversion: Speech prosody of an utterance

is characterized by phonetic duration, energy contour, and

pitch contour. To effectively measure how close the prosody

patterns of converted speech is to the reference speech, we

need to provide measurements for the three aspects.

The alignment between the converted speech and the

reference speech provides the information about how much

the phonetic duration differs one another. We can derive

the number of frames that deviate from the ideal diagonal

path on average, such as frame disturbance [242], to report

the differences of phonetic duration.

Pearson Correlation Coefﬁcient (PCC) [62], [205] and Root

Mean Squared Error (RMSE) have been widely used as the

evaluation metrics to measure the linear dependence of

prosody contours or energy contours between two speech

utterances.

We next take the measurement of two prosody contours

as an example. PCC between the aligned pair of converted

and target F0 sequences is given as follows,

ρ(F0c,F0t)=cov(F0c,F0t)

σF0cσF0t

(18)

where σF0cand σF0tare the standard deviations of the

converted F0 sequences (F0c) and the target F0 sequences

(F0t), respectively. We note that a higher PCC value repre-

sents better F0 conversion performance.

The RMSE between the converted F0 and the correspond-

ing target F0 is deﬁned as,

RMSE =v

k=1

(F0c

k−F0t

k)2(19)

where F0c

kand F0t

kdenote the converted and target F0

features, respectively. Kis the length of F0 sequence, or

the total number of frames. We note that a lower RMSE

value represents better F0 conversion performance. The

same measurement applies to energy contours as well.

Other generally-accepted metrics for prosody transfer

include F0 Frame Error (FFE) [243] and Gross Pitch Error

(GPE) [244]. We note that GPE reports the percentage of

voiced frames whose pitch values are more than 20% dif-

ferent from the reference, while FFE reports the percentage

of frames that either contain a 20% pitch error or a voicing

decision error [245].

B. Subjective Evaluation

Mean Opinion Score (MOS) has been widely used in lis-

tening tests [40], [61], [62], [246]–[251]. In MOS experiments,

listeners rate the quality of the converted voice using a 5-

point scale: “5” for excellent, “4” for good, “3” for fair, “2”

for poor, and “1” for bad. There are several evaluation meth-

ods that are similar to MOS, for example: 1) DMOS [252]–

[254], which is a “degradation” or “differential” MOS test,

requiring listeners to rate the sample with respect to this

reference, and 2) MUSHRA [255]–[257], which stands for

MUltiple Stimuli with Hidden Reference and Anchor, and

requires fewer participants than MOS to obtain statistically

signiﬁcant results.

Another popular subjective evaluation is preference test,

also denoted as AB/ABX test [2], [11], [40], [258]. In AB tests,

listeners are presented with two speech samples and asked

to indicate which one has more of a certain property; for

example in terms of naturalness, or similarity. In ABX test,

similar to that of AB, two samples are given but an extra

reference sample is also given. Listeners need to judge if A

or B more like X in terms of naturalness, similarity, or even

emotional quality [205]. We note that it is not practical to

use AB and/or ABX test for the comparison of many VC

systems at the same time. MUSHRA is another type of voice

quality test in telecommunication [259], where the reference

natural speech and several other converted samples of the

same content are presented to the listeners in a random

order. The listeners are asked to rate the speech quality of

each sample between 0 and 100.

It is known that people are good at picking the extremes

but their preferences for anything in between might be

fuzzy and inaccurate when presented with a long list of

options. Best-Worst Scaling (BWS) [236] is proposed for

voice conversion quality assessment [22], where listeners

are presented only with a few randomly selected options

each time. With many such BWS decisions, Best-Worst

Scaling can handle a long list of options and generates more

discriminating results, such voice quality ranking, than MOS

and preference tests.

We note that subjective measures can represent the

intrinsic naturalness and similarity of a voice conversion

system. However, such evaluation can be time-consuming

and expensive as they involve a large number of listeners.

C. Evaluation with Deep Learning Approaches

The study of perceptual quality evaluation seeks to ap-

proximate human judgement with computational models

of psychoacoustic motivation. It provides insights into how

humans perceive speech quality in listening tests, and

suggests assessment metrics, that are required in speech

communication, speech enhancement, speech synthesis,

voice conversion and any other speech production or

transmission applications. Perceptual Evaluation of Speech

Quality (PESQ) [260] is an ITU-T recommendation that

is widely used as industry standard. It provides objec-

tive speech quality evaluation that predicts the human-

perceived speech quality.

However, the PESQ formulation requires the presence

of reference speech, that considerably restricts its use in

voice conversion applications, and motivates the study

of perceptual evaluations without the need of reference

speech. Those metrics that don’t require reference speech

are called non-intrusive evaluation metrics. For example,

Fu et al. [261] propose Quality-Net [261] that is an end-to-

end model to predict PESQ ratings, that are the proxy for

human ratings. Yoshimura et al. [262], Patton et al. [263]

propose a CNN-based naturalness predictor to predict hu-

man MOS ratings, among other non-intrusive assessment

metrics [264]–[266].

Lo et al. [267] propose MOSNet, another non-intrusive

assessment technique based on deep neural networks, that

learns to predict human MOS ratings. MOSNet scores are

highly correlated with human MOS ratings at system level,

and fairly correlated at utterance level. While it is a non-

intrusive evaluation metric for naturalness, MOSNet can

also be modiﬁed and re-purposed to predict the similarity

scores between target speech and converted speech. It

provides similarity scores with fair correlation values to

human ratings on VCC 2018 dataset. MOSNet marks a

recent advancement towards automatic perceptual quality

evaluation [268], which is free and open-source.

VII. VOICE CONVERSION CHALLENGES

In this section, we would like to give an overview of the

series of voice conversion challenges, that provide shared

tasks with common data sets and evaluation metrics for fair

comparison of algorithms. The voice conversion challenge

(VCC) is a biannual event since 2016. In a challenge,

a common database is provided by the organizers. The

participants build voice conversion systems using their own

technology, and the organizers evaluate the performance of

the converted speech. The main evaluation methodology is

a listening test in which crowd-sourced evaluators rank the

naturalness and speaker similarity.

The 2016 challenge offers a standard voice conversion

task using a parallel training database was adopted [269].

The 2018 challenge features a more advanced conversion

scenario using a non-parallel database [270]. The 2020

challenge puts forward a cross-lingual voice conversion

research problem. A summary of VCC 2016, VCC 2018 and

VCC 2020 is also provided in Table I.

A. Why is the Challenge Needed?

As described earlier, many of the voice conversion ap-

proaches are data-driven, hence speech data are required

to train models and for conversion evaluation. To compare

such data-driven methods each other precisely, a common

database that speciﬁes training and evaluation data explic-

itly is needed. However, such common database did not ex-

ist until 2016. Without common databases, researchers have

to re-implement others’ system with their own databases

before trying any new ideas. In such situation, it is not

guaranteed that the re-implemented system achieves the

expected performance in the original work.

To address the same problem, the TTS community gave

birth to the ﬁrst Blizzard challenge in 2005. Since then,

the challenge has deﬁned various standard databases for

TTS and has made comparisons of TTS much fairer and

easier. The motivations of VCC are exactly the same as those

of the Blizzard challenges. VCC introduced a few standard

databases for voice conversion and also deﬁned the com-

mon training and evaluation protocols. All the converted

speech submitted by the participants for the challenges

have been released publicly. In this way, researchers can

compare the performance of their voice conversion system

with that of other state-of-the-art systems without the need

of re-implementation.

Another need on voice conversion standard databases

arose from biometric speaker recognition community. As

the voice conversion technology could be misused for

attacking speaker veriﬁcation systems, anti-spooﬁng coun-

termeasures are required [271]. This is also called presen-

tation attack detection. Anti-spooﬁng techniques aim at

discriminating between fake artiﬁcial inputs presented to

biometric authentication systems and genuine inputs. If

sufﬁcient knowledge and data regarding the spoofed data

is available, a binary classiﬁer can be constructed to reject

artiﬁcial inputs. Therefore, the common VCC databases

are also important for anti-spooﬁng research. With many

converted speech data from advanced voice conversion sys-

tems, researchers in the biometric community can develop

anti-spooﬁng models to strengthen the defence of speaker

recognition systems, and to evaluate their vulnerabilities.

B. Overview of the 2016 Voice Conversion Challenge

We ﬁrst overview the 2016 voice conversion challenge

[269] and its datasets1. As the ﬁrst shared task in voice

conversion, a parallel voice conversion task and its eval-

uation protocol are deﬁned for VCC 2016. The parallel

dataset consists of 162 common sentences uttered by both

source and target speakers. Target and source speakers are

four native speakers of American English (two females and

two males), respectively. In the challenge, the participants

develop the conversion systems and produce converted

speech for all possible source-target pair combinations.

In total, eight speakers (plus two unused speakers) are

included in the VCC 2016 database. The number of test

sentences for evaluation is 54.

The main evaluation methodology adopted for the rank-

ing is subjective evaluation on perceived naturalness and

speaker similarity of the converted samples to target speak-

ers. The naturalness is evaluated using the standard ﬁve-

point scale mean-opinion score (MOS) test ranging from

1 (completely unnatural) to 5 (completely natural). The

speaker similarity was evaluated using the Same/Different

paradigm [272]. Subjects are asked to listen to two audio

samples and to judge if they are speech signals produced

by the same speaker in a four point scale: “Same, absolutely

sure”, “Same, not sure”, “Different, not sure” and “Different,

absolutely sure.” As the perceived speaker similarity to a

target speaker, and the perceived voice quality are not

necessarily correlated, it is important to use a scatter-plot

to observe the trade-off between the two aspects.

1The VCC2016 dataset is available at https://doi.org/10.7488/ds/1575

Challenge Language Task Training Data # Speakers Testing Data

VCC 2016 monolingual parallel 162 paired utterances 4 source, 4 target 54 utterances

VCC 2018 monolingual parallel 81 paired utterances 4 source, 4 target 35 utterances

monolingual nonparallel 81 unpaired utterances 4 source, 4 target 35 utterances

VCC 2020 monolingual parallel + nonparallel 20 paired, 50 unpaired utterances 4 source, 4 target 25 utterances

crosslingual nonparallel 70 unpaired utterances 4 source, 6 target 25 utterances

TABLE I: Summary of VCC 2016, VCC 2018 and VCC 2020.

In the 2016 challenge, 17 participants submitted their

conversion results. Two hundreds native listeners of English

joined the listening tests. It is reported that the best system

using GMM and waveform ﬁltering obtained an average

of 3.0 in the ﬁve-point scale evaluation for the naturalness

judgement, and about 70% of its converted speech samples

are judged to be the same as target speakers by listeners.

However, it is also conﬁrmed that there is still a huge gap

between target natural speech and the converted speech.

We observe that it remains a unsolved challenge to achieve

good quality and speaker similarity at that time. More

details of VCC 2016 can be found at [272]. Details of best

performing systems are reported in [273].

C. Overview of the 2018 Voice Conversion Challenge

Next we give an overview of the 2018 voice conversion

challenge [270] and its datasets2. VCC 2018 offers two tasks,

parallel and non-parallel voice conversion tasks. A dataset

and its evaluation protocol are deﬁned for each task. The

dataset for the parallel conversion task is similar to that of

the 2016 challenge, except that it has a smaller number of

common utterances uttered by source and target speakers.

Target and source speakers are four native speakers of

American English (two females and two males), respectively,

but, they are different speakers from those used for the 2016

challenge. Like the 2016 challenge, the participants were

asked to develop conversion systems and to produce con-

verted data for all possible source-target pair combinations.

VCC 2018 introduced a non-parallel voice conversion task

for the ﬁrst time. The same target speakers’ data in the

parallel task are used as the target. However, the source

speakers are four native speakers of American English (2

females and 2 males) different from those of the parallel

conversion task and their utterances are also all different

from those of the target speakers. Like the parallel voice

conversion task, converted data for all possible source-

target pair combinations needed to be produced by the

participants. In total twelve speakers are included in the

VCC 2018 database. Each of the source and target speakers

has a set of 81 sentences as training data, which is half

of that for VCC 2016. The number of test sentences for

evaluation is 35.

In the 2018 challenge, 23 participants submitted their

conversion results to the parallel conversion task, with

11 of them additionally participating in the non-parallel

conversion task. The same evaluation methodology as the

2016 challenge was adopted for the 2018 challenge and 260

2The VCC2018 dataset is available at https://doi.org/10.7488/ds/2337.

crowd-sourced native listeners of English have joined the

listening tests. It was reported that in both tasks, the best

system using phone encoder and neural vocoder obtained

an average of 4.1 in the ﬁve-point scale evaluation for

the naturalness judgement and about 80% of its converted

speech samples were judged to be the same as target speak-

ers by listeners. It was also reported that the best system has

similar performance in both the parallel and non-parallel

tasks in contrast to results reported in literature.

In VCC 2018, the spooﬁng countermeasure was intro-

duced as an supplement to subjective evaluation of voice

quality, that brought together the voice conversion and

speaker veriﬁcation research community. More details of

the 2018 challenge can be found at [270]. Details of best

performing systems are reported in [274], [275].

From this challenge, we observed that new speech wave-

form generation paradigms such as WaveNet and phone

encoding have brought signiﬁcant progress to the voice

conversion ﬁeld. Further improvements have been achieved

in the follow up papers [276], [277] and new VC systems that

exceed the challenge’s best performance have already been

reported.

D. Overview of the 2020 Voice Conversion Challenge

The 2020 voice conversion challenge3consists of two

tasks: 1) non-parallel training in the same language (En-

glish); and 2) non-parallel training over different languages

(English-Finnish, English-German, and English-Mandarin).

In the ﬁrst task, each participant trains voice conversion

models for all source and target speaker pairs using up

to 70 utterances, including 20 parallel utterances and 50

non-parallel utterances in English, for each speaker as the

training data. Overall, 16 voice conversion models (i.e., 4

sources by 4 targets) are to be developed. In the second

task, each participant develops voice conversion models for

all source and target speaker pairs using up to 70 utterances

for each speaker (i.e., in English for the source speakers, and

in Finnish, German, or Mandarin for the target speakers)

as the training data. Overall, 24 conversion systems (i.e., 4

sources by 6 targets) are to be developed.

In the 2020 challenge, the participants are allowed to

mix and combine different source speaker’s data to train

speaker-independent models. Moreover, the participants

can also use orthographic transcriptions of the released

training data to develop their voice conversion systems.

Last but not least, the participants are free to perform

manual annotations of the released training data, which can

effectively improves the quality of the converted speech.

3The 2020 VCC whitepaper: http://www.vc-challenge.org/rules.html.

The 2020 challenge organizers also built several baseline

systems including the top system of the previous challenge

on the new database. The codes of CycleVAE-based base-

line4and Cascade ASR + TTS based VC 5are released

so that participants can build the basic systems easily

and focus on their own innovation. The 2020 challenge

also features a multifaceted evaluation. In addition to the

traditional evaluation metrics, the challenge also reports the

speech recognition, speaker recognition, and anti-spooﬁng

evaluation results on the converted speech. The challenge

is underway at the time we submit this manuscript.

E. Relevant Challenges – ASVspoof Challenge

The spooﬁng capability against automatic speaker veriﬁ-

cation is a related topic to voice conversion, that has also

been organized as technology challenges. The ASVspoof

series of challenges are such biannual events, which started

in 2013. Like in the voice conversion challenges, the orga-

nizers release a common database including many pairs of

spoofed audio (converted, generated audio or replay audio)

and genuine audio to the participants, who build anti-

spooﬁng models using their own technology. The organizers

rank the detection accuracy of the anti-spooﬁng results

submitted by the participants.

In 2015, the ﬁrst anti-spooﬁng database including various

types of spoofed audio using voice conversion and TTS

systems was constructed. This database became a reference

standard in the automatic speaker veriﬁcation (ASV) com-

munity [278], [279]. The main focus of the 2017 challenge

was a replay task, where a large quantity of real-world

replay speech data were collected [280]. In 2019, an even

larger database including converted, generated, and replay

speech data was constructed [281]. The best performing

systems in the 2016 and 2018 voice conversion challenges

were also used for generating advanced spoofed audio [282].

The challenges revealed that some anti-spooﬁng systems

outperform human listeners in detecting spoofed audio.

VIII. RESO URCES

In addition to the voice conversion challenge databases

described above, the CMU-Arctic database [283] and the

VCTK databases [284] are also popular for voice conversion

research. The current version of the CMU-Arctic database6

has 18 English speakers and each of them reads out the

same set of around 1,150 utterances, which are carefully

selected from out-of-copyright texts from Project Guten-

berg. This is suitable for parallel voice conversion since

sentences are common to all the speakers. The current

version (ver. 0.92) of the CSTR VCTK corpus7has speech

data uttered by 110 English speakers with various dialects.

Each speaker reads out about 400 sentences, which are

selected from newspapers, the rainbow passage and an

elicitation paragraph used for the speech accent archive.

4https://github.com/bigpon/vcc20_baseline_cyclevae

5https://github.com/espnet/espnet/tree/master/egs/vcc20.

6http://www.festvox.org/cmu_arctic/

7https://doi.org/10.7488/ds/2645

Since the rainbow passage and an elicitation paragraph are

common to all the speakers, this database can be used for

both parallel and non-parallel voice conversion.

Since neural networks are data hungry and generalization

to unseen speakers is a key for successful conversion, large-

scale, but, low-quality databases such as LibriTTS and Vox-

Celeb are also used for training some components required

(e.g. speaker encoder) for voice conversion. The LibriTTS

corpus [285] has 585 hours of transcribed speech data

uttered by total of 2,456 speakers. The recording condition

and audio quality are less than ideal, but, this corpus is

suitable for training speaker encoder networks or general-

izing any-to-any speaker mapping network. The VoxCeleb

database [286] is further a larger scale speech database

consisting of about 2,800 hours of untranscribed speech

from over 6,000 speakers. This is an appropriate database

for training noise-robust speaker encoder networks.

There are many open-source codes for training VC

models. For instance, spocket [287] supports GMM-based

conversions and ESPnet [288] supports cascaded ASR and

TTS system. In addition, there are many open-source codes

for neural-network based voice conversion written by the

community at github8.

IX. CONCLUSION

This article provides a comprehensive overview of the

voice conversion technology, covering the fundamentals

and practice till July 2020. We reveal the underlying tech-

nologies and their relationship from the statistical ap-

proaches to deep learning, and discuss their promise and

limitations. We also study the evaluation techniques for

voice conversion. Moreover, we report the series of voice

conversion challenges and resources that are useful infor-

mation for researchers and engineers to start voice conver-

sion research.

REFERENCES

[1] John Q. Stewart, “An electrical analogue of the vocal organs,” Nature,

vol. 110, pp. 311–312.

[2] Alexander Kain and Michael W Macon, “Spectral voice conversion

for text-to-speech synthesis,” in Proceedings of the 1998 IEEE

International Conference on Acoustics, Speech and Signal Processing,

ICASSP’98 (Cat. No. 98CH36181). IEEE, 1998, vol. 1, pp. 285–288.

[3] Mingyang Zhang, Xin Wang, Fuming Fang, Haizhou Li, and Junichi

Yamagishi, “Joint training framework for text-to-speech and voice

conversion using multi-source tacotron and wavenet,” arXiv preprint

arXiv:1903.12389, 2019.

[4] Christophe Veaux, Junichi Yamagishi, and Simon King, “Towards

personalised synthesised voices for individuals with vocal disabili-

ties: Voice banking and reconstruction,” 08 2013.

[5] Brij Srivastava, Nathalie Vauquier, Md Sahidullah, Aurélien Bel-

let, Marc Tommasi, and Emmanuel Vincent, “Evaluating voice

conversion-based privacy protection against informed attackers,” 11

2019.

[6] Zhizheng Wu and Haizhou Li, “Voice conversion versus speaker

veriﬁcation: an overview,” APSIPA Transactions on Signal and

Information Processing, vol. 3, pp. e17, 2014.

[7] Chien yu Huang, Yist Y. Lin, Hung yi Lee, and Lin shan Lee,

“Defending your voice: Adversarial attack on voice conversion,”

ArXiv, vol. abs/2005.08781, 2020.

8https://paperswithcode.com/task/voice-conversion

[8] Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano, and Hisao

Kuwabara, “Voice conversion through vector quantization,” Journal

of the Acoustical Society of Japan (E), vol. 11, no. 2, pp. 71–76, 1990.

[9] Kiyohiro Shikano, Satoshi Nakamura, and Masanobu Abe, “Speaker

Adaptation and Voice Conversion by Codebook Mapping,” IEEE

International Sympoisum on Circuits and Systems, pp. 594–597, 1991.

[10] Elina Helander, Jan Schwarz, Jani Nurminen, Hanna Silen, and

Moncef Gabbouj, “On the impact of alignment on voice conversion

performance,” in Ninth Annual Conference of the International

Speech Communication Association, 2008.

[11] Tomoki Toda, Alan W. Black, and Keiichi Tokuda, “Voice conversion

based on maximum-likelihood estimation of spectral parameter

trajectory,” IEEE Transactions on Audio, Speech and Language

Processing, vol. 15, no. 8, pp. 2222–2235, 2007.

[12] Heiga Zen, Yoshihiko Nankaku, and Keiichi Tokuda, “Probabilistic

feature mapping based on trajectory hmms,” in Ninth Annual

Conference of the International Speech Communication Association,

2008.

[13] Kazuhiro Kobayashi, Shinnosuke Takamichi, Satoshi Nakamura, and

Tomoki Toda, “The NU-NAIST voice conversion system for the Voice

Conversion Challenge 2016,” in INTERSPEECH, 2016.

[14] Elina Helander, Tuomas Virtanen, Jani Nurminen, and Moncef

Gabbouj, “Voice conversion using partial least squares regression,”

IEEE Transactions on Audio, Speech, and Language Processing, vol.

18, no. 5, pp. 912–921, 2010.

[15] Elina Helander, Hanna Silén, Tuomas Virtanen, and Moncef Gab-

bouj, “Voice conversion using dynamic kernel partial least squares

regression,” IEEE transactions on audio, speech, and language

processing, vol. 20, no. 3, pp. 806–817, 2011.

[16] Yi Luan, Daisuke Saito, Yosuke Kashiwagi, Nobuaki Minematsu, and

Keikichi Hirose, “Semi-supervised noise dictionary adaptation for

exemplar-based noise robust speech recognition,” in 2014 IEEE

international conference on acoustics, speech and signal processing

(ICASSP). IEEE, 2014, pp. 1745–1748.

[17] Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki, “Exemplar-

based voice conversion in noisy environment,” In IEEE SLT, pp.

313–317, 2012.

[18] Zhizheng Wu, Tuomas Virtanen, Eng Siong Chng, and Haizhou Li,

“Exemplar-based sparse representation with residual compensation

for voice conversion,” IEEE/ACM Transactions on Audio, Speech and

Language Processing, vol. 22, no. 10, pp. 1506–1521, 2014.

[19] Ryo Aihara, Kenta Masaka, Tetsuya Takiguchi, and Yasuo Ariki,

“Parallel dictionary learning for multimodal voice conversion using

matrix factorization,” In INTERSPEECH, pp. 27–40, 2016.

[20] Zeyu Jin, Adam Finkelstein, Stephen DiVerdi, Jingwan Lu, and Gau-

tham J Mysore, “Cute: A concatenative method for voice conversion

using exemplar-based unit selection,” in 2016 IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP).

IEEE, 2016, pp. 5660–5664.

[21] Ryo Aihara, Toru Nakashika, Tetsuya Takiguchi, and Yasuo Ariki,

“Voice conversion based on non-negative matrix factorization using

phoneme-categorized dictionary,” in 2014 IEEE International Con-

ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,

2014, pp. 7894–7898.

[22] Berrak Sisman, Haizhou Li, and Kay Chen Tan, “Sparse represen-

tation of phonetic features for voice conversion with and without

parallel data,” in 2017 IEEE Automatic Speech Recognition and

Understanding Workshop (ASRU). IEEE, 2017, pp. 677–684.

[23] Mikiko Mashimo, Tomoki Toda, Hiromichi Kawanami, Kiyohiro

Shikano, and Nick Campbell, “Cross-language voice conversion

evaluation using bilingual databases,” IPSJ Journal, 2002.

[24] David Sundermann, Harald Hoge, Antonio Bonafonte, Hermann

Ney, Alan Black, and Shri Narayanan, “Text-independent voice

conversion based on unit selection,” in 2006 IEEE International

Conference on Acoustics Speech and Signal Processing Proceedings.

IEEE, 2006, vol. 1, pp. I–I.

[25] Hao Wang, Frank Soong, and Helen Meng, “A spectral space warping

approach to cross-lingual voice transformation in hmm-based tts,”

in 2015 IEEE International Conference on Acoustics, Speech and

Signal Processing (ICASSP). IEEE, 2015, pp. 4874–4878.

[26] David Sundermann, Hermann Ney, and H Hoge, “Vtln-based

crosslanguage voice conversion,” IEEE ASRU, 2003.

[27] D. Erro, A. Moreno, and A. Bonafonte, “Inca algorithm for training

voice conversion systems from nonparallel corpora,” IEEE Transac-

tions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp.

944–953, 2010.

[28] Daniel Erro and Asuncion Moreno, “Frame alignment method for

cross-lingual voice conversion,” INTERSPEECH, 1972.

[29] Jianhua Tao, Meng Zhang, Jani Nurminen, Jilei Tian, and Xia Wang,

“Supervisory data alignment for text-independent voice conversion,”

IEEE Transactions on Audio, Speech, and Language Processing, vol.

18, no. 5, pp. 932–943, 2010.

[30] Hanna Silen, Jani Nurminen, Elina Helander, and Moncef Gabbouj,

“Voice conversion for non-parallel datasets using dynamic kernel

partial least squares regression,” IEEE Transactions on Audio, Speech,

and Language Processing, vol. 20, no. 3, pp. 806–817, 2012.

[31] Peng Song, Yun Jin, Wenming Zheng, and Li Zhao, “Text-

independent voice conversion using speaker model alignment

method from non-parallel speech,” In Proceedings of the Annual

Conference of the International Speech Communication Association,

INTERSPEECH, , no. September, pp. 2308–2312, 2014.

[32] Lifa Sun, Kun Li, Hao Wang, Shiyin Kang, and Helen Meng, “Phonetic

posteriorgrams for many-to-one voice conversion without parallel

data training,” in 2016 IEEE International Conference on Multimedia

and Expo (ICME). IEEE, 2016, pp. 1–6.

[33] Timothy J Hazen, Wade Shen, and Christopher White, “Query-

by-example spoken term detection using phonetic posteriorgram

templates,” In IEEE ASRU, pp. 421–426, 2009.

[34] Keith Kintzley, Aren Jansen, and Hynek Hermansky, “Event selection

from phone posteriorgrams using matched ﬁlters,” In INTER-

SPEECH, pp. 1905–1908, 2011.

[35] Seyed Hamidreza Mohammadi and Alexander Kain, “An overview

of voice conversion systems,” Speech Communication, vol. 88, pp.

65–82, 2017.

[36] M Narendranath, Hema A Murthy, S Rajendran, and B Yegna-

narayana, “Transformation of formants for voice conversion using

artiﬁcial neural networks,” Speech communication, vol. 16, no. 2,

pp. 207–216, 1995.

[37] Kurt Hornik, Maxwell Stinchcombe, and Halbert White, “Multi-

layer feedforward networks are universal approximators,” Neural

networks, vol. 2, no. 5, pp. 359–366, 1989.

[38] Rabul Hussain Laskar, D Chakrabarty, Fazal Ahmed Talukdar,

K Sreenivasa Rao, and Kalyan Banerjee, “Comparing ann and gmm

in a voice conversion framework,” Applied Soft Computing, vol. 12,

no. 11, pp. 3332–3342, 2012.

[39] Hy Quy Nguyen, Siu Wa Lee, Xiaohai Tian, Minghui Dong, and

Eng Siong Chng, “High quality voice conversion using prosodic

and high-resolution spectral features,” Multimedia Tools and Appli-

cations, vol. 75, no. 9, pp. 5265–5285, 2016.

[40] Lifa Sun, Shiyin Kang, Kun Li, and Helen Meng, “Voice conversion

using deep bidirectional long short-term memory based recurrent

neural networks,” in 2015 IEEE international conference on acoustics,

speech and signal processing (ICASSP). IEEE, 2015, pp. 4869–4873.

[41] Jie Wu, Zhizheng Wu, and Lei Xie, “On the use of I-vectors and

average voice model for voice conversion without parallel data,”

In IEEE International Conference on Acoustics, Speech and Signal

Processing (ICASSP), 2016.

[42] Feng Long Xie, Frank K. Soong, and Haifeng Li, “A KL divergence

and DNN-based approach to voice conversion without parallel

training sentences,” In Proceedings of the Annual Conference of the

International Speech Communication Association, INTERSPEECH,

vol. 08-12-September-2016, pp. 287–291, 2016.

[43] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-

Min Wang, “Voice conversion from non-parallel corpora using vari-

ational auto-encoder,” in 2016 Asia-Paciﬁc Signal and Information

Processing Association Annual Summit and Conference (APSIPA).

IEEE, 2016, pp. 1–6.

[44] Xiaohai Tian, Junchao Wang, Xu Haihua, Eng Siong Chng, and

Haizhou Li, “Average Modeling Approach to Voice Conversion

with Non-Parallel Data,” Odyssey 2018 The Speaker and Language

Recognition Workshop, pp. 1–10, 2018.

[45] Lifa Sun, Hao Wang, Shiyin Kang, Kun Li, and Helen Meng, “Per-

sonalized, cross-lingual TTS using phonetic posteriorgrams,” In

INTERSPEECH, pp. 322–326, 2016.

[46] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-

Min Wang, “Voice Conversion from Unaligned Corpora using Varia-

tional Autoencoding Wasserstein Generative Adversarial Networks,”

arXiv:1704.00849 [cs.CL], 2017.

[47] Takuhiro Kaneko and Hirokazu Kameoka, “Parallel-data-free voice

conversion using cycle-consistent adversarial networks,” arXiv

preprint arXiv:1711.11293, 2017.

[48] Fuming Fang, Junichi Yamagishi, Isao Echizen, and Jaime Lorenzo-

Trueba, “High-quality nonparallel voice conversion based on cycle-

consistent adversarial network,” in 2018 IEEE International Con-

ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,

2018, pp. 5279–5283.

[49] Jaime Lorenzo-Trueba, Fuming Fang, Xin Wang, Isao Echizen, Ju-

nichi Yamagishi, and Tomi Kinnunen, “Can we steal your vocal iden-

tity from the Internet?: Initial investigation of cloning Obama’s voice

using GAN, WaveNet and low-quality found data,” arXiv:1803.00860

[eess.AS], 2018.

[50] Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu

Hojo, “StarGAN-VC: Non-parallel many-to-many voice conversion

with star generative adversarial networks,” arXiv:1806.02169 [cs.SD],

2018.

[51] Manu Airaksinen, Lauri Juvela, Bajibabu Bollepalli, Junichi Yam-

agishi, and Paavo Alku, “A comparison between straight, glottal,

and sinusoidal vocoding in statistical parametric speech synthesis,”

IEEE/ACM Transactions on Audio, Speech, and Language Processing,

vol. 26, no. 9, pp. 1658–1670, 2018.

[52] Xin Wang, Jaime Lorenzo-Trueba, Shinji Takaki, Lauri Juvela, and

Junichi Yamagishi, “A comparison of recent waveform generation

and acoustic modeling methods for neural-network-based speech

synthesis,” in Proceedings of the IEEE International Conference on

Acoustics, Speech, and Signal Processing (ICASSP), Calgary, Canada,

April 2018, pp. 4804–4808.

[53] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan,

Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and

Koray Kavukcuoglu, “Wavenet: A generative model for raw audio,”

arXiv preprint arXiv:1609.03499, 2016.

[54] Akira Tamamori, Tomoki Hayashi, Kazuhiro Kobayashi, Kazuya

Takeda, and Tomoki Toda, “Speaker-dependent wavenet vocoder.,”

in Interspeech, 2017, vol. 2017, pp. 1118–1122.

[55] Tomoki Hayashi, Akira Tamamori, Kazuhiro Kobayashi, Kazuya

Takeda, and Tomoki Toda, “An investigation of multi-speaker

training for wavenet vocoder,” in 2017 IEEE Automatic Speech

Recognition and Understanding Workshop (ASRU). IEEE, 2017, pp.

712–718.

[56] Yi-Chiao Wu, Tomoki Hayashi, Patrick Lumban Tobing, Kazuhiro

Kobayashi, and Tomoki Toda, “Quasi-periodic wavenet vocoder: a

pitch dependent dilated convolution model for parametric speech

generation,” arXiv preprint arXiv:1907.00797, 2019.

[57] Yi-Chiao Wu, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro

Kobayashi, and Tomoki Toda, “Statistical voice conversion with

quasi-periodic wavenet vocoder,” arXiv preprint arXiv:1907.08940,

2019.

[58] Berrak Sisman, Mingyang Zhang, Sakriani Sakti, Haizhou Li, and

Satoshi Nakamura, “Adaptive wavenet vocoder for residual com-

pensation in gan-based voice conversion,” in 2018 IEEE Spoken

Language Technology Workshop (SLT). IEEE, 2018, pp. 282–289.

[59] H. Du, X. Tian, L. Xie, and H. Li, “Wavenet factorization with singular

value decomposition for voice conversion,” in 2019 IEEE Automatic

Speech Recognition and Understanding Workshop (ASRU), 2019, pp.

152–159.

[60] Wen-Chin Huang, Yi-Chiao Wu, Hsin-Te Hwang, Patrick Lumban

Tobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda, Yu Tsao,

and Hsin-Min Wang, “Reﬁned wavenet vocoder for variational

autoencoder based voice conversion,” in 2019 27th European Signal

Processing Conference (EUSIPCO). IEEE, 2019, pp. 1–5.

[61] Berrak Sisman, Mingyang Zhang, and Haizhou Li, “A voice con-

version framework with tandem feature sparse representation and

speaker-adapted wavenet vocoder.,” in Interspeech, 2018, pp. 1978–

1982.

[62] Berrak Sisman, Mingyang Zhang, and Haizhou Li, “Group Sparse

Representation with WaveNet Vocoder Adaptation for Spectrum and

Prosody Conversion,” IEEE/ACM Transactions on Audio, Speech and

Language Processing, 2019.

[63] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman

Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den

Oord, Sander Dieleman, and Koray Kavukcuoglu, “Efﬁcient neural

audio synthesis,” arXiv preprint arXiv:1802.08435, 2018.

[64] Ryan Prenger, Raffael Valle, and Bryan Catanzaro, “WaveGlow: A

Flow-based Generative Network for Speech Synthesis,” in Proceed-

ings of the IEEE International Conference on Acoustics, Speech and

Signal Processing (ICASSP), Brighton, UK, May 2019, pp. 3617–3621.

[65] Tomoki Toda, Ling-Hui Chen, Daisuke Saito, Fernando Villavicencio,

Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi, “The voice

conversion challenge 2016.,” in Interspeech, 2016, pp. 1632–1636.

[66] Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi, “Multidimen-

sional scaling of systems in the voice conversion challenge 2016.,”

in SSW, 2016, pp. 38–43.

[67] Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi, “Analysis of the

voice conversion challenge 2016 evaluation results.,” in Interspeech,

2016, pp. 1637–1641.

[68] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke

Saito, Fernando Villavicencio, Tomi Kinnunen, and Zhenhua Ling,

“The voice conversion challenge 2018: Promoting development of

parallel and nonparallel methods,” arXiv preprint arXiv:1804.04262,

2018.

[69] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke

Saito, Fernando Villavicencio, Tomi Kinnunen, Zhenhua Ling, et al.,

“the voice conversion challenge 2018: database and results,” 2018.

[70] Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, Kazuhiro

Kobayashi, and Tomoki Toda, “Nu voice conversion system for the

voice conversion challenge 2018.,” in Odyssey, 2018, pp. 219–226.

[71] Daniel Grifﬁn and Jae Lim, “Signal estimation from modiﬁed short-

time fourier transform,” IEEE Transactions on Acoustics, Speech, and

Signal Processing, vol. 32, no. 2, pp. 236–243, 1984.

[72] Eric Moulines and Francis Charpentier, “Pitch-synchronous wave-

form processing techniques for text-to-speech synthesis using di-

phones,” Speech communication, vol. 9, no. 5-6, pp. 453–467, 1990.

[73] Hélene Valbret, Eric Moulines, and Jean-Pierre Tubach, “Voice

transformation using psola technique,” Speech communication, vol.

11, no. 2-3, pp. 175–187, 1992.

[74] Levent M Arslan, “Speaker transformation algorithm using segmen-

tal codebooks (stasc),” Speech Communication, vol. 28, no. 3, pp.

211–226, 1999.

[75] Yannis Stylianou, “Applying the harmonic plus noise model in

concatenative speech synthesis,” IEEE Transactions on speech and

audio processing, vol. 9, no. 1, pp. 21–29, 2001.

[76] Yannis Stylianou and Olivier Cappe, “A system for voice conversion

based on probabilistic classiﬁcation and a harmonic plus noise

model,” in Proceedings of the 1998 IEEE International Conference

on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No.

98CH36181). IEEE, 1998, vol. 1, pp. 281–284.

[77] Daniel Erro and Asunción Moreno, “Weighted frequency warping for

voice conversion,” in Eighth Annual Conference of the International

Speech Communication Association, 2007.

[78] “Mel log spectrum approximation (mlsa) ﬁlter for speech synthesis,”

Electronics and Communications in Japan (Part I: Communications),

vol. 66, no. 2, pp. 10–18, 1983.

[79] M. Airaksinen, L. Juvela, B. Bollepalli, J. Yamagishi, and P. Alku, “A

comparison between straight, glottal, and sinusoidal vocoding in

statistical parametric speech synthesis,” IEEE/ACM Transactions on

Audio, Speech, and Language Processing, vol. 26, no. 9, pp. 1658–

1670, 2018.

[80] Hideki Kawahara, Ikuyo Masuda-Katsuse, and Alain De Cheveigne,

“Restructuring speech representations using a pitch-adaptive time–

frequency smoothing and an instantaneous-frequency-based f0 ex-

traction: Possible role of a repetitive structure in sounds,” Speech

communication, vol. 27, no. 3-4, pp. 187–207, 1999.

[81] Srinivas Desai, E Veera Raghavendra, B Yegnanarayana, Alan W

Black, and Kishore Prahallad, “Voice conversion using artiﬁcial neu-

ral networks,” in 2009 IEEE International Conference on Acoustics,

Speech and Signal Processing. IEEE, 2009, pp. 3893–3896.

[82] Berrak Sisman and Haizhou Li, “Wavelet analysis of speaker

dependent and independent prosody for voice conversion.,” in

Interspeech, 2018, pp. 52–56.

[83] Wei-Ning Hsu, Yu Zhang, and James Glass, “Unsupervised learning

of disentangled and interpretable representations from sequential

data,” in Advances in neural information processing systems, 2017,

pp. 1878–1889.

[84] Wei-Ning Hsu, Yu Zhang, and James Glass, “Learning latent repre-

sentations for speech generation and transformation,” arXiv preprint

arXiv:1704.04222, 2017.

[85] Sadaoki Furui, “Digital speech processing, synthesis, and recogni-

tion(revised and expanded),” Digital Speech Processing, Synthesis,

and Recognition, 2000.

[86] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep

Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang,

RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and

Yonghui Wu, “Natural tts synthesis by conditioning wavenet on mel

spectrogram predictions,” arXiv:1712.05884, 2018.

[87] Rui Liu, Berrak Sisman, Jingdong Li, Feilong Bao, Guanglai Gao, and

Haizhou Li, “Teacher-student training for robust tacotron-based tts,”

arXiv preprint arXiv:1911.02839, 2019.

[88] Zden ˇek Hanzlíˇcek, Jakub Vít, and Daniel Tihelka, “Wavenet-based

speech synthesis applied to czech,” in International Conference on

Text, Speech, and Dialogue. Springer, 2018, pp. 445–452.

[89] Sercan Ö Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos,

Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng,

Jonathan Raiman, et al., “Deep voice: Real-time neural text-to-

speech,” in Proceedings of the 34th International Conference on

Machine Learning-Volume 70. JMLR. org, 2017, pp. 195–204.

[90] Berrak Sisman, Machine Learning for Limited Data Voice Conversion,

Ph.D. thesis, 2019.

[91] Kuan Chen, Bo Chen, Jiahao Lai, and Kai Yu, “High-quality voice

conversion using spectrogram-based wavenet vocoder.,” in Inter-

speech, 2018, pp. 1993–1997.

[92] Nagaraj Adiga, Vassilis Tsiaras, and Yannis Stylianou, “On the use

of wavenet as a statistical vocoder,” in 2018 IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP).

IEEE, 2018, pp. 5674–5678.

[93] Yi Zhao, Shinji Takaki, Hieu-Thi Luong, Junichi Yamagishi, Daisuke

Saito, and Nobuaki Minematsu, “Wasserstein gan and waveform

loss-based acoustic model training for multi-speaker text-to-speech

synthesis systems using a wavenet vocoder,” IEEE Access, vol. 6, pp.

60478–60488, 2018.

[94] Xiaohai Tian, Eng Siong Chng, and Haizhou Li, “A speaker-

dependent wavenet for voice conversion with non-parallel data,”

Proceedings of the Interspeech, Graz, Austria, pp. 15–19, 2019.

[95] Hui Lu, Zhiyong Wu, Runnan Li, Shiyin Kang, Jia Jia, and Helen

Meng, “A compact framework for voice conversion using wavenet

conditioned on phonetic posteriorgrams,” in ICASSP 2019-2019 IEEE

International Conference on Acoustics, Speech and Signal Processing

(ICASSP). IEEE, 2019, pp. 6810–6814.

[96] Hongqiang Du, Xiaohai Tian, Lei Xie, and Haizhou Li, “Wavenet fac-

torization with singular value decomposition for voice conversion,”

in 2019 IEEE Automatic Speech Recognition and Understanding

Workshop (ASRU). IEEE, 2019, pp. 152–159.

[97] Songxiang Liu, Yuewen Cao, Xixin Wu, Lifa Sun, Xunying Liu,

and Helen Meng, “Jointly trained conversion model and wavenet

vocoder for non-parallel voice conversion using mel-spectrograms

and phonetic posteriorgrams,” Proc. Interspeech 2019, pp. 714–718,

2019.

[98] Jan Chorowski, Ron Weiss, Samy Bengio, and Aaron Oord, “Unsuper-

vised speech representation learning using wavenet autoencoders,”

IEEE/ACM Transactions on Audio, Speech, and Language Processing,

vol. PP, pp. 1–1, 09 2019.

[99] Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas

Merritt, Bartosz Putrycz, Roberto Barra-Chicote, Alexis Moinet, and

Vatsal Aggarwal, “Towards achieving robust universal neural vocod-

ing,” in Proc. Interspeech, 2019, vol. 2019, pp. 181–185.

[100] Prachi Govalkar, Johannes Fischer, Frank Zalkow, and Christian

Dittmar, “A comparison of recent neural vocoders for speech signal

reconstruction,” in Proc. 10th ISCA Speech Synthesis Workshop, 2019,

pp. 7–12.

[101] Yuan-Hao Yi, Yang Ai, Zhen-Hua Ling, and Li-Rong Dai, “Singing

voice synthesis using deep autoregressive neural networks for acous-

tic modeling,” arXiv preprint arXiv:1906.08977, 2019.

[102] Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, and Hisashi Kawai,

“Real-time neural text-to-speech with sequence-to-sequence acous-

tic model and waveglow or single gaussian wavernn vocoders,” in

Proc. Interspeech, 2019, vol. 2019, pp. 1308–1312.

[103] Soumi Maiti and Michael I Mandel, “Parametric resynthesis with

neural vocoders,” in 2019 IEEE Workshop on Applications of Signal

Processing to Audio and Acoustics (WASPAA). IEEE, 2019, pp. 303–

307.

[104] Xin Wang, Shinji Takaki, and Junichi Yamagishi, “Neural source-

ﬁlter-based waveform model for statistical parametric speech syn-

thesis,” in ICASSP 2019-2019 IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp.

5916–5920.

[105] Xin Wang and Junichi Yamagishi, “Neural harmonic-plus-noise

waveform model with trainable maximum voice frequency for text-

to-speech synthesis,” arXiv preprint arXiv:1908.10256, 2019.

[106] Xin Wang, Shinji Takaki, and Junichi Yamagishi, “Neural source-

ﬁlter waveform models for statistical parametric speech synthesis,”

IEEE/ACM Transactions on Audio, Speech, and Language Processing,

vol. 28, pp. 402–415, 2019.

[107] Xiaohai Tian, Siu Wa Lee, Zhizheng Wu, Eng Siong Chng, Senior

Member, and Haizhou Li, “An Exemplar-based Approach to Fre-

quency Warping for Voice Conversion,” pp. 1–10, 2016.

[108] Hisao Kuwabara and Yoshinori Sagisak, “Acoustic characteristics of

speaker individuality: Control and conversion,” Speech communica-

tion, vol. 16, no. 2, pp. 165–173, 1995.

[109] Yannis Stylianou, Olivier Cappé, and Eric Moulines, “Continuous

probabilistic transform for voice conversion,” IEEE Transactions on

speech and audio processing, vol. 6, no. 2, pp. 131–142, 1998.

[110] Hiroshi Matsumoto and Yasuki Yamashita, “Unsupervised speaker

adaptation from short utterances based on a minimized fuzzy

objective function,” Journal of the Acoustical Society of Japan (E),

vol. 14, no. 5, pp. 353–361, 1993.

[111] Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano, “Voice con-

version algorithm based on gaussian mixture model with dynamic

frequency warping of straight spectrum,” in 2001 IEEE International

Conference on Acoustics, Speech, and Signal Processing. Proceedings

(Cat. No. 01CH37221). IEEE, 2001, vol. 2, pp. 841–844.

[112] Tomoki Toda, Jinlin Lu, Satoshi Nakamura, and Kiyohiro Shikano,

“Voice conversion algorithm based on gaussian mixture model

applied to straight,” 2000.

[113] Tomoki Toda, Alan W Black, and Keiichi Tokuda, “Spectral conver-

sion based on maximum likelihood estimation considering global

variance of converted parameter,” in Proceedings.(ICASSP’05). IEEE

International Conference on Acoustics, Speech, and Signal Processing,

2005. IEEE, 2005, vol. 1, pp. I–9.

[114] Todd K Moon, “The expectation-maximization algorithm,” IEEE

Signal processing magazine, vol. 13, no. 6, pp. 47–60, 1996.

[115] Chuong B Do and Seraﬁm Batzoglou, “What is the expectation

maximization algorithm?,” Nature biotechnology, vol. 26, no. 8, pp.

897–899, 2008.

[116] Guorong Xuan, Wei Zhang, and Peiqi Chai, “Em algorithms of gaus-

sian mixture model and hidden markov model,” in Proceedings 2001

International Conference on Image Processing (Cat. No. 01CH37205).

IEEE, 2001, vol. 1, pp. 145–148.

[117] Maya R Gupta, Yihua Chen, et al., “Theory and use of the em

algorithm,” Foundations and Trends® in Signal Processing, vol. 4,

no. 3, pp. 223–296, 2011.

[118] Shinnosuke Takamichi, Tomoki Toda, Alan W Black, and Satoshi

Nakamura, “Modulation spectrum-based post-ﬁlter for gmm-based

voice conversion,” in Signal and Information Processing Association

Annual Summit and Conference (APSIPA), 2014 Asia-Paciﬁc. IEEE,

2014, pp. 1–4.

[119] Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro

Shikano, “Maximum likelihood voice conversion based on gmm

with straight mixed excitation,” 2006.

[120] Hiromichi Kawanami, Yohei Iwami, Tomoki Toda, Hiroshi

Saruwatari, and Kiyohiro Shikano, “Gmm-based voice conversion

applied to emotional speech synthesis,” in Eighth European

Conference on Speech Communication and Technology, 2003.

[121] Ryo Aihara, Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo

Ariki, “Gmm-based emotional voice conversion using spectrum and

prosody features,” American Journal of Signal Processing, vol. 2, no.

5, pp. 134–138, 2012.

[122] Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, Yih-Ru Wang, and Sin-

Horng Chen, “Incorporating global variance in the training phase

of gmm-based voice conversion,” in 2013 Asia-Paciﬁc Signal and

Information Processing Association Annual Summit and Conference.

IEEE, 2013, pp. 1–6.

[123] Tudor-C˘at˘alin Zoril ˘a, Daniel Erro, and Inma Hernáez, “Improving

the quality of standard gmm-based voice conversion systems by

considering physically motivated linear transformations,” in Ad-

vances in Speech and Language Technologies for Iberian Languages,

pp. 30–39. Springer, 2012.

[124] Mostafa Ghorbandoost, Abolghasem Sayadiyan, Mohsen Ahangar,

Hamid Sheikhzadeh, Abdoreza Sabzi Shahrebabaki, and Jamal

Amini, “Voice conversion based on feature combination with limited

training data,” Speech Communication, vol. 67, pp. 113–128, 2015.

[125] Manuel Sam Ribeiro, Junichi Yamagishi, and Robert AJ Clark, “A

perceptual investigation of wavelet-based decomposition of f0 for

text-to-speech synthesis,” in Sixteenth Annual Conference of the

International Speech Communication Association, 2015.

[126] Manuel Sam Ribeiro, Oliver Watts, Junichi Yamagishi, and Robert AJ

Clark, “Wavelet-based decomposition of f0 as a secondary task for

dnn-based speech synthesis with multi-task learning,” in 2016 IEEE

International Conference on Acoustics, Speech and Signal Processing

(ICASSP). IEEE, 2016, pp. 5525–5529.

[127] Cheng-Cheng Wang, Zhen-Hua Ling, Bu-Fan Zhang, and Li-Rong

Dai, “Multi-layer f0 modeling for hmm-based speech synthesis,”

in 2008 6th International Symposium on Chinese Spoken Language

Processing. IEEE, 2008, pp. 1–4.

[128] Gerard Sanchez, Hanna Silen, Jani Nurminen, and Moncef Gabbouj,

“Hierarchical modeling of F0 contours for voice conversion,” In

Proceedings of the Annual Conference of the International Speech

Communication Association, INTERSPEECH, , no. September, pp.

2318–2321, 2014.

[129] Daniel Erro, Asunción Moreno, and Antonio Bonafonte, “Voice con-

version based on weighted frequency warping,” IEEE Transactions

on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 922–

931, 2009.

[130] David Sundermann and Hermann Ney, “Vtln-based voice con-

version,” in Proceedings of the 3rd IEEE International Symposium

on Signal Processing and Information Technology (IEEE Cat. No.

03EX795). IEEE, 2003, pp. 556–559.

[131] Matthias Eichner, Matthias Wolff, and Rüdiger Hoffmann, “Voice

characteristics conversion for tts using reverse vtln,” in 2004 IEEE

International Conference on Acoustics, Speech, and Signal Processing.

IEEE, 2004, vol. 1, pp. I–17.

[132] Anna Pˇribilová and Jiˇrí Pˇribil, “Non-linear frequency scale mapping

for voice conversion in text-to-speech system with cepstral descrip-

tion,” Speech Communication, vol. 48, no. 12, pp. 1691–1703, 2006.

[133] Robert Vích and Martin Vondra, “Pitch synchronous transform

warping in voice conversion,” in Cognitive Behavioural Systems, pp.

280–289. Springer, 2012.

[134] Elizabeth Godoy, Olivier Rosec, and Thierry Chonavel, “Voice con-

version using dynamic frequency warping with amplitude scaling,

for parallel or nonparallel corpora,” IEEE Transactions on Audio,

Speech, and Language Processing, vol. 20, no. 4, pp. 1313–1323, 2011.

[135] Dd Lee and Hs Seung, “Algorithms for non-negative matrix factor-

ization,” Advances in neural information processing systems, , no. 1,

pp. 556–562, 2001.

[136] Syu-Siang Wang, Alan Chern, Yu Tsao, Jeigh-Weih Hung, Xugang Lu,

Ying-Hui Lai, and Borching Su, “Wavelet speech enhancement based

on nonnegative matrix factorization,” IEEE Signal Processing Letters,

vol. 23, 2016.

[137] Nasser Mohammadiha, Paris Smaragdis, and Arne Leijon, “Super-

vised and Unsupervised Speech Enhancement Using Nonnegative

Matrix Factorization,” IEEE Transactions on Audio, Speech and

Language Processing, vol. 21, no. 10, pp. 2140–2151, 2013.

[138] K A Akarsh, “Speech Enhancement using Non negative Matrix

Factorization and Enhanced NMF,” International Conference on

Circuit, Power and Computing Technologies (ICCPCT), 2015.

[139] Kevin W Wilson, Bhiksha Raj, Paris Smaragdis, and Ajay Divakaran,

“Speech denoising using nonnegative matrix factorization with pri-

ors,” In IEEE International Conference on Acoustics, Speech and

Signal Processing (ICASSP), 2008.

[140] Meng Sun, Yinan Li, Jort F Gemmeke, and Xiongwei Zhang, “Speech

enhancement under low SNR conditions via noise estimation us-

ing sparse and low-rank NMF with Kullback-Leibler divergence,”

IEEE/ACM Transactions on Audio, Speech and Language Processing,

vol. 23, no. 7, pp. 1233–1242, 2015.

[141] Zhizheng Wu, Tuomas Virtanen, Tomi Kinnunen, Eng Siong Chng,

and Haizhou Li, “Examplar-Based Voice Conversion Using Non-

Negative Spectrogram Deconvolution,” 8th ISCA Speech Synthesis

Workshop, 2013.

[142] Yi Chiao Wu, Hsin Te Hwang, Chin Cheng Hsu, Yu Tsao, and

Hsin Min Wang, “Locally linear embedding for exemplar-based

spectral conversion,” In Proceedings of the Annual Conference of the

International Speech Communication Association, INTERSPEECH,

vol. 08-12-September-2016, no. 1, pp. 1652–1656, 2016.

[143] Huaiping Ming, Dongyan Huang, Lei Xie, Shaofei Zhang, Minghui

Dong, and Haizhou Li, “Exemplar-based sparse representation of

timbre and prosody for voice conversion,” In IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP),

2016.

[144] Berrak ¸Si¸sman, Haizhou Li, and Kay Chen Tan, “Transformation

of prosody in voice conversion,” in 2017 Asia-Paciﬁc Signal and

Information Processing Association Annual Summit and Conference

(APSIPA ASC). IEEE, 2017, pp. 1537–1546.

[145] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-

Min Wang, “Dictionary update for nmf-based voice conversion using

an encoder-decoder network,” 10th International Symposium on

Chinese Spoken Language Processing (ISCSLP), vol. 22, no. 3, pp.

293–297, 2016.

[146] Hermann Ney, David Suendermann, Antonio Bonafonte, and Harald

Höge, “A ﬁrst step towards text-independent voice conversion,”

in Eighth International Conference on Spoken Language Processing,

2004.

[147] Hui Ye and Steve J. Young, “Voice conversion for unknown speakers,”

in INTERSPEECH 2004 - ICSLP, 8th International Conference on

Spoken Language Processing, Jeju Island, Korea, October 4-8, 2004.

2004, ISCA.

[148] Hui Ye and Steve Young, “Quality-enhanced voice morphing using

maximum likelihood transformations,” IEEE Transactions on Audio,

Speech, and Language Processing, vol. 14, pp. 1301 – 1312, 08 2006.

[149] Alan W Black and Nick Campbell, “Optimising selection of units

from speech databases for concatenative synthesis.,” 1995.

[150] Kei Fujii, Jun Okawa, and Kaori Suigetsu, “High individuality voice

conversion based on concatenative speech synthesis,” International

Journal of Electrical, Computer, Energetic, Electronic and Communi-

cation Engineering, vol. 1, no. 11, pp. 1617–1622, 2007.

[151] Yoshinori Sagisaka, Nobuyoshi Kaiki, Naoto Iwahashi, and Katsuhiko

Mimura, “Atr µ-talk speech synthesis system,” in Second Interna-

tional Conference on Spoken Language Processing, 1992.

[152] Daniel Erro, Ferran Diego, and Antonio Bonafonte, “Voice conver-

sion of non-aligned data using unit selection,” 2006.

[153] A. Mouchtaris, J. Van der Spiegel, and P. Mueller, “Nonparallel

training for voice conversion based on a parameter adaptation

approach,” IEEE Transactions on Audio, Speech, and Language

Processing, vol. 14, no. 3, pp. 952–963, 2006.

[154] Tomoki Toda, Yamato Ohtani, and Kiyohiro Shikano, “Eigenvoice

conversion based on gaussian mixture model,” in INTERSPEECH,

2006.

[155] Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and

Pierre Ouellet, “Front-end factor analysis for speaker veriﬁcation,”

IEEE Transactions on Audio, Speech, and Language Processing, vol.

19, no. 4, pp. 788–798, 2010.

[156] Z. Wu, T. Kinnunen, E. S. Chng, and H. Li, “Mixture of factor ana-

lyzers using priors from non-parallel speech for voice conversion,”

IEEE Signal Processing Letters, vol. 19, no. 12, pp. 914–917, 2012.

[157] Yannis Stylianou, Olivier Cappé, and Eric Moulines, “Continuous

probabilistic transform for voice conversion,” IEEE Transactions on

Speech and Audio Processing, vol. 6, no. 2, pp. 131–142, 1998.

[158] Yi Zhou, Xiaohai Tian, Haihua Xu, Rohan Kumar Das, and Haizhou

Li, “Cross-lingual voice conversion with bilingual phonetic pos-

teriorgrams and average modeling,” International Conference on

Acoustic, Speech and Signal Processing (ICASSP), 2019.

[159] Tomi Kinnunen, Lauri Juvela, Paavo Alku, and Junichi Yamagishi,

“Non-parallel voice conversion using i-vector plda: Towards unifying

speaker veriﬁcation and transformation,” in 2017 IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP).

IEEE, 2017, pp. 5535–5539.

[160] Hiroyuki Miyoshi, Yuki Saito, Shinnosuke Takamichi, and Hiroshi

Saruwatari, “Voice conversion using sequence-to-sequence learning

of context posterior probabilities,” arXiv preprint arXiv:1704.02360,

2017.

[161] Seung-won Park, Doo-young Kim, and Myun-chul Joe, “Cotatron:

Transcription-guided speech encoder for any-to-many voice conver-

sion without parallel data,” ArXiv, vol. abs/2005.03295, 2020.

[162] Feng-Long Xie, Yao Qian, Frank K Soong, and Haifeng Li, “Pitch

transformation in neural network based voice conversion,” in

The 9th International Symposium on Chinese Spoken Language

Processing. IEEE, 2014, pp. 197–200.

[163] Toru Nakashika, Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo

Ariki, “Voice conversion in high-order eigen space using deep belief

nets.,” in Interspeech, 2013, pp. 369–372.

[164] Seyed Hamidreza Mohammadi and Alexander Kain, “Voice con-

version using deep neural networks with speaker-independent pre-

training,” in 2014 IEEE Spoken Language Technology Workshop (SLT).

IEEE, 2014, pp. 19–23.

[165] Feng-Long Xie, Yao Qian, Yuchen Fan, Frank K Soong, and Haifeng

Li, “Sequence error (se) minimization training of neural network

for voice conversion,” in Fifteenth Annual Conference of the Inter-

national Speech Communication Association, 2014.

[166] Keiichi Tokuda, Takayoshi Yoshimura, Takashi Masuko, Takao

Kobayashi, and Tadashi Kitamura, “Speech parameter generation

algorithms for hmm-based speech synthesis,” in 2000 IEEE In-

ternational Conference on Acoustics, Speech, and Signal Processing.

Proceedings (Cat. No. 00CH37100). IEEE, 2000, vol. 3, pp. 1315–1318.

[167] Ling-hui Chen, Zhen-hua Ling, Li-juan Liu, and Li-rong Dai, “Voice

Conversion Using Deep Neural Networks With Layer-Wise Genera-

tive Training,” IEEE Transactions on Audio, Speech and Language

Processing, vol. 22, no. 12, pp. 1859–1872, 2014.

[168] Toru Nakashika, Tetsuya Takiguchi, and Yasuo Ariki, “High-order

sequence modeling using speaker-dependent recurrent temporal

restricted Boltzmann machines for voice conversion,” In Proceedings

of the Annual Conference of the International Speech Communication

Association, INTERSPEECH, , no. September, pp. 2278–2282, 2014.

[169] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term mem-

ory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[170] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins, “Learning

to forget: Continual prediction with lstm,” 1999.

[171] Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink,

and Jürgen Schmidhuber, “Lstm: A search space odyssey,” IEEE

transactions on neural networks and learning systems, vol. 28, no.

10, pp. 2222–2232, 2016.

[172] Huaiping Ming, Dongyan Huang, Lei Xie, Jie Wu, Minghui Dong,

and Haizhou Li, “Deep bidirectional LSTM modeling of timbre

and prosody for emotional voice conversion,” In Proceedings of

the Annual Conference of the International Speech Communication

Association, INTERSPEECH, vol. 08-12-September-2016, pp. 2453–

2457, 2016.

[173] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural

machine translation by jointly learning to align and translate,” arXiv

preprint arXiv:1409.0473, 2014.

[174] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion

Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Atten-

tion is all you need,” in Advances in neural information processing

systems, 2017, pp. 5998–6008.

[175] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and

spell: A neural network for large vocabulary conversational speech

recognition,” in 2016 IEEE International Conference on Acoustics,

Speech and Signal Processing (ICASSP), 2016, pp. 4960–4964.

[176] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J

Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen,

Samy Bengio, et al., “Tacotron: Towards end-to-end speech syn-

thesis,” arXiv preprint arXiv:1703.10135, 2017.

[177] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O Arik, Ajay

Kannan, Sharan Narang, Jonathan Raiman, and John Miller, “Deep

voice 3: 2000-speaker neural text-to-speech,” arXiv preprint

arXiv:1710.07654, 2017.

[178] Hideyuki Tachibana, Katsuya Uenoyama, and Shunsuke Aihara,

“Efﬁciently trainable text-to-speech system based on deep convolu-

tional networks with guided attention,” in 2018 IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP).

IEEE, 2018, pp. 4784–4788.

[179] Jing-Xuan Zhang, Zhen-Hua Ling, Li-Juan Liu, Yuan Jiang, and

Li-Rong Dai, “Sequence-to-sequence acoustic modeling for voice

conversion,” IEEE/ACM Transactions on Audio, Speech, and Language

Processing, vol. 27, no. 3, pp. 631–644, 2019.

[180] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry

Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio,

“Learning phrase representations using rnn encoder-decoder for

statistical machine translation,” arXiv preprint arXiv:1406.1078,

2014.

[181] K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, “Atts2s-vc:

Sequence-to-sequence voice conversion with attention and context

preservation mechanisms,” in ICASSP 2019 - 2019 IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP),

2019, pp. 6805–6809.

[182] Minh-Thang Luong, Hieu Pham, and Christopher D Manning, “Ef-

fective approaches to attention-based neural machine translation,”

arXiv preprint arXiv:1508.04025, 2015.

[183] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann

Dauphin, “Convolutional sequence to sequence learning,” ArXiv,

vol. abs/1705.03122, 2017.

[184] Hirokazu Kameoka, Kou Tanaka, Takuhiro Kaneko, and Nobukatsu

Hojo, “Convs2s-vc: Fully convolutional sequence-to-sequence voice

conversion,” ArXiv, vol. abs/1811.01609, 2018.

[185] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros, “Un-

paired image-to-image translation using cycle-consistent adversarial

networks,” in Proceedings of the IEEE international conference on

computer vision, 2017, pp. 2223–2232.

[186] Kenan E Ak, Joo Hwee Lim, Jo Yew Tham, and Ashraf A Kassim,

“Attribute manipulation generative adversarial networks for fashion

images,” in Proceedings of the IEEE International Conference on

Computer Vision, 2019, pp. 10541–10550.

[187] Kenan E Ak, Ashraf A Kassim, Joo Hwee Lim, and Jo Yew Tham,

“Learning attribute representations with localization for ﬂexible

fashion search,” in Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, 2018, pp. 7708–7717.

[188] Kenan Emir Ak, Deep learning approaches for attribute manipulation

and text-to-image synthesis, Ph.D. thesis, 2019.

[189] Kenan E Ak, Joo Hwee Lim, Jo Yew Tham, and Ashraf A Kassim,

“Efﬁcient multi-attribute similarity learning towards attribute-based

fashion search,” in 2018 IEEE Winter Conference on Applications of

Computer Vision (WACV). IEEE, 2018, pp. 1671–1679.

[190] Kenan E Ak, Ning Xu, Zhe Lin, and Yilin Wang, “Incorporating

reinforced adversarial learning in autoregressive image generation,”

arXiv preprint arXiv:2007.09923, 2020.

[191] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David

Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio,

“Generative adversarial nets,” in Advances in neural information

processing systems, 2014, pp. 2672–2680.

[192] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz, “Mul-

timodal unsupervised image-to-image translation,” in Proceedings

of the European Conference on Computer Vision (ECCV), 2018, pp.

172–189.

[193] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A

Efros, Oliver Wang, and Eli Shechtman, “Toward multimodal image-

to-image translation,” in Advances in neural information processing

systems, 2017, pp. 465–476.

[194] Kenan E Ak, Joo Hwee Lim, Jo Yew Tham, and Ashraf A Kas-

sim, “Semantically consistent text to fashion image synthesis with

an enhanced attentional generative adversarial network,” Pattern

Recognition Letters, 2020.

[195] Kenan Emir Ak, Joo Hwee Lim, Jo Yew Tham, and Ashraf Kassim,

“Semantically consistent hierarchical text to fashion image synthesis

with an enhanced-attentional generative adversarial network,” in

Proceedings of the IEEE International Conference on Computer Vision

Workshops, 2019, pp. 0–0.

[196] Zhong Meng, Jinyu Li, Yifan Gong, and Biing-Hwang (Fred) Juang,

“Cycle-Consistent Speech Enhancement,” INTERSPEECH, 2018.

[197] Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara, “Cross-

domain speech recognition using nonparallel corpora with cycle-

consistent adversarial networks,” IEEE Automatic Speech Recognition

and Understanding Workshop (ASRU), 2017.

[198] Dongsuk Yook, In-Chul Yoo, and Seungho Yoo, “Voice conversion

using conditional cyclegan,” in 2018 International Conference on

Computational Science and Computational Intelligence (CSCI). IEEE,

2018, pp. 1460–1461.

[199] Sicong Huang, Qiyang Li, Cem Anil, Xuchan Bao, Sageev Oore,

and Roger B Grosse, “Timbretron: A wavenet (cyclegan (cqt

(audio))) pipeline for musical timbre transfer,” arXiv preprint

arXiv:1811.09620, 2018.

[200] Takuhiro Kaneko and Hirokazu Kameoka, “Cyclegan-vc: Non-parallel

voice conversion using cycle-consistent adversarial networks,” in

2018 26th European Signal Processing Conference (EUSIPCO). IEEE,

2018, pp. 2100–2104.

[201] Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu

Hojo, “Cyclegan-vc2: Improved cyclegan-based non-parallel voice

conversion,” in ICASSP 2019-2019 IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp.

6820–6824.

[202] Yaniv Taigman, Adam Polyak, and Lior Wolf, “Unsupervised cross-

domain image generation,” ArXiv abs/1611.02200, 2016.

[203] Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, Kazuhiro

Kobayashi, and Tomoki Toda, “Voice conversion with cyclic recur-

rent neural network and ﬁne-tuned wavenet vocoder,” in ICASSP

2019-2019 IEEE International Conference on Acoustics, Speech and

Signal Processing (ICASSP). IEEE, 2019, pp. 6815–6819.

[204] Berrak Sisman, Mingyang Zhang, Minghui Dong, and Haizhou Li,

“On the study of generative adversarial networks for cross-lingual

voice conversion,” in 2019 IEEE Automatic Speech Recognition and

Understanding Workshop (ASRU). IEEE, 2019, pp. 144–151.

[205] Kun Zhou, Berrak Sisman, and Haizhou Li, “Transforming spec-

trum and prosody for emotional voice conversion with non-parallel

training data,” arXiv preprint arXiv:2002.00198, 2020.

[206] Kun Zhou, Berrak Sisman, Mingyang Zhang, and Haizhou Li, “Con-

verting anyone’s emotion: Towards speaker-independent emotional

voice conversion,” arXiv preprint arXiv:2005.07025, 2020.

[207] Cheng-chieh Yeh, Po-chun Hsu, Ju-chieh Chou, Hung-yi Lee, and

Lin-shan Lee, “Rhythm-ﬂexible voice conversion without parallel

data using cycle-gan over phoneme posteriorgram sequences,” in

2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018,

pp. 274–281.

[208] Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, and Haizhou

Li, “Wavetts: Tacotron-based tts with joint time-frequency domain

loss,” arXiv preprint arXiv:2002.00417, 2020.

[209] Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, Hirokazu

Kameoka, and Tomoki Toda, “Voice transformer network: Sequence-

to-sequence voice conversion using transformer with text-to-speech

pretraining,” arXiv preprint arXiv:1912.06813, 2019.

[210] Jing-Xuan Zhang, Zhen-Hua Ling, Yuan Jiang, Li-Juan Liu, Chen

Liang, and Li-Rong Dai, “Improving sequence-to-sequence voice

conversion by adding text-supervision,” in ICASSP 2019-2019 IEEE

International Conference on Acoustics, Speech and Signal Processing

(ICASSP). IEEE, 2019, pp. 6785–6789.

[211] Hieu-Thi Luong and Junichi Yamagishi, “Bootstrapping non-parallel

voice conversion from speaker-adaptive text-to-speech,” in 2019

IEEE Automatic Speech Recognition and Understanding Workshop

(ASRU). IEEE, 2019, pp. 200–207.

[212] Hieu-Thi Luong and Junichi Yamagishi, “Nautilus: a versatile voice

cloning system,” arXiv preprint arXiv:2005.11004, 2020.

[213] Fadi Biadsy, Ron J Weiss, Pedro J Moreno, Dimitri Kanvesky, and

Ye Jia, “Parrotron: An end-to-end speech-to-speech conversion

model and its applications to hearing-impaired speech and speech

separation,” arXiv preprint arXiv:1904.04169, 2019.

[214] Songxiang Liu, Yuewen Cao, and Helen Meng, “Multi-target emo-

tional voice conversion with neural vocoders,” arXiv preprint

arXiv:2004.03782, 2020.

[215] Mingyang Zhang, Berrak Sisman, Sai Sirisha Rallabandi, Haizhou

Li, and Li Zhao, “Error reduction network for dblstm-based voice

conversion,” in 2018 Asia-Paciﬁc Signal and Information Processing

Association Annual Summit and Conference (APSIPA ASC). IEEE,

2018, pp. 823–828.

[216] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo

Larochelle, and Ole Winther, “Autoencoding beyond pixels using

a learned similarity metric,” Proceedings of The 33rd International

Conference on Machine Learning, PMLR, 2016.

[217] Ju-Chieh Chou, Cheng chieh Yeh, and Hung yi Lee, “One-shot voice

conversion by separating speaker and content representations with

instance normalization,” ArXiv, vol. abs/1904.05742, 2019.

[218] Da-Yi Wu and Hung-yi Lee, “One-shot voice conversion by vector

quantization,” in ICASSP 2020-2020 IEEE International Conference

on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp.

7734–7738.

[219] Da-Yi Wu, Yen-Hao Chen, and Hung-Yi Lee, “Vqvc+: One-shot voice

conversion by vector quantization and u-net architecture,” arXiv

preprint arXiv:2006.04154, 2020.

[220] Diederik P Kingma and Max Welling, “Auto-encoding variational

bayes,” arXiv preprint arXiv:1312.6114, 2013.

[221] Shaojin Ding and Ricardo Gutierrez-Osuna, “Group latent embed-

ding for vector quantized variational autoencoder in non-parallel

voice conversion.,” in INTERSPEECH, 2019, pp. 724–728.

[222] Wen-Chin Huang, Hsin-Te Hwang, Yu-Huai Peng, Yu Tsao, and Hsin-

Min Wang, “Voice conversion based on cross-domain features using

variational auto encoders,” in 2018 11th International Symposium

on Chinese Spoken Language Processing (ISCSLP). IEEE, 2018, pp.

51–55.

[223] Yanping Li, Kong Aik Lee, Yougen Yuan, Haizhou Li, and Zhen Yang,

“Many-to-many voice conversion based on bottleneck features with

variational autoencoder for non-parallel training data,” in 2018

Asia-Paciﬁc Signal and Information Processing Association Annual

Summit and Conference (APSIPA ASC). IEEE, 2018, pp. 829–833.

[224] Yuki Saito, Yusuke Ijima, Kyosuke Nishida, and Shinnosuke

Takamichi, “Non-parallel voice conversion using variational autoen-

coders conditioned by phonetic posteriorgrams and d-vectors,” in

2018 IEEE International Conference on Acoustics, Speech and Signal

Processing (ICASSP). IEEE, 2018, pp. 5274–5278.

[225] Wen-Chin Huang, Hao Luo, Hsin-Te Hwang, Chen-Chou Lo, Yu-

Huai Peng, Yu Tsao, and Hsin-Min Wang, “Unsupervised represen-

tation disentanglement using cross domain features and adversarial

learning in variational autoencoder based voice conversion,” IEEE

Transactions on Emerging Topics in Computational Intelligence, p.

1–12, 2020.

[226] Songxiang Liu, Yuewen Cao, Shiyin Kang, Na Hu, Xunying Liu, Dan

Su, Dong Yu, and Helen Meng, “Transferring source style in non-

parallel voice conversion,” arXiv preprint arXiv:2005.09178, 2020.

[227] R. Kubichek, “Mel-cepstral distance measure for objective speech

quality assessment,” Communications, Computers and Signal Pro-

cessing, pp. 125–128, 1993.

[228] Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen,

“Pearson correlation coefﬁcient,” in Noise reduction in speech

processing, pp. 1–4. Springer, 2009.

[229] Tianfeng Chai and Roland R Draxler, “Root mean square error (rmse)

or mean absolute error (mae)?–arguments against avoiding rmse in

the literature,” Geoscientiﬁc model development, vol. 7, no. 3, pp.

1247–1250, 2014.

[230] Cort J Willmott and Kenji Matsuura, “Advantages of the mean

absolute error (mae) over the root mean square error (rmse) in

assessing average model performance,” Climate research, vol. 30,

no. 1, pp. 79–82, 2005.

[231] Volodya Grancharov and W Bastiaan Kleijn, “Speech quality as-

sessment,” in Springer handbook of speech processing, pp. 83–100.

Springer, 2008.

[232] Robert C Streijl, Stefan Winkler, and David S Hands, “Mean opinion

score (mos) revisited: methods and applications, limitations and

alternatives,” Multimedia Systems, vol. 22, no. 2, pp. 213–227, 2016.

[233] Min Chu, Hu Peng, and Yong Zhao, “Optimization of an objective

measure for estimating mean opinion score of synthesized speech,”

June 10 2008, US Patent 7,386,451.

[234] Mahesh Viswanathan and Madhubalan Viswanathan, “Measuring

speech quality for text-to-speech systems: development and assess-

ment of a modiﬁed mean opinion score (mos) scale,” Computer

Speech & Language, vol. 19, no. 1, pp. 55–83, 2005.

[235] Alexander Kain and Michael W Macon, “Design and evaluation of

a voice conversion algorithm based on spectral envelope mapping

and residual prediction,” in 2001 IEEE International Conference

on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.

01CH37221). IEEE, 2001, vol. 2, pp. 813–816.

[236] Terry N. Flynn and Anthony A. J. Marley, “Best worst scaling:

Theory and methods,” Handbook of choice modelling, Edward Elgar

Publishing, pp. 178–201, 2014.

[237] Tomoki Toda, Ling-Hui Chen, Daisuke Saito, Fernando Villavicencio,

Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi, “The Voice

Conversion Challenge 2016,” In INTERSPEECH, pp. 1632–1636, 2016.

[238] Mingyang Zhang, Berrak Sisman, Li Zhao, and Haizhou Li, “Deep-

conversion: Voice conversion with limited parallel training data,”

Speech Communication, 2020.

[239] Jiahao Lai, Bo Chen, Tian Tan, Sibo Tong, and Kai Yu, “Phone-aware

lstm-rnn for voice conversion,” in 2016 IEEE 13th International

Conference on Signal Processing (ICSP). IEEE, 2016, pp. 177–182.

[240] Alan W Black, H Timothy Bunnell, Ying Dou, Prasanna Kumar

Muthukumar, Florian Metze, Daniel Perry, Tim Polzehl, Kishore

Prahallad, Stefan Steidl, and Callie Vaughn, “Articulatory features for

expressive speech synthesis,” in 2012 IEEE International Conference

on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2012, pp.

4005–4008.

[241] Beth Logan et al., “Mel frequency cepstral coefﬁcients for music

modeling.,” in Ismir, 2000, vol. 270, pp. 1–11.

[242] Chitralekha Gupta, Haizhou Li, and Ye Wang, “Perceptual evaluation

of singing quality,” in 2017 Asia-Paciﬁc Signal and Information

Processing Association Annual Summit and Conference (APSIPA ASC),

2017, pp. 577–586.

[243] Wei Chu and Abeer Alwan, “Reducing f0 frame error of f0 tracking

algorithms under noisy conditions with an unvoiced/voiced classiﬁ-

cation frontend,” in 2009 IEEE International Conference on Acoustics,

Speech and Signal Processing. IEEE, 2009, pp. 3969–3972.

[244] Tomohiro Nakatani, Shigeaki Amano, Toshio Irino, Kentaro Ishizuka,

and Tadahisa Kondo, “A method for fundamental frequency estima-

tion and voicing decision: Application to infant utterances recorded

in real acoustical environments,” Speech Communication, vol. 50,

no. 3, pp. 203–214, 2008.

[245] RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy

Stanton, Joel Shor, Ron J Weiss, Rob Clark, and Rif A Saurous, “To-

wards end-to-end prosody transfer for expressive speech synthesis

with tacotron,” arXiv preprint arXiv:1803.09047, 2018.

[246] Berrak Sisman, Grandee Lee, Haizhou Li, and Kay Chen Tan, “On

the analysis and evaluation of prosody conversion techniques,” in

2017 International Conference on Asian Language Processing (IALP).

IEEE, 2017, pp. 44–47.

[247] Tomomi Watanabe, Takahiro Murakami, Munehiro Namba, Tetsuya

Hoya, and Yoshihisa Ishida, “Transformation of spectral envelope

for voice conversion based on radial basis function networks,” in

Seventh international conference on spoken language processing,

2002.

[248] Kazuhiro Kobayashi, Shinnosuke Takamichi, Satoshi Nakamura, and

Tomoki Toda, “The nu-naist voice conversion system for the voice

conversion challenge 2016.,” in Interspeech, 2016, pp. 1667–1671.

[249] B Ramani, MP Actlin Jeeva, P Vijayalakshmi, and T Nagarajan,

“Cross-lingual voice conversion-based polyglot speech synthesizer

for indian languages,” in Fifteenth annual conference of the inter-

national speech communication association, 2014.

[250] Oytun Turk and Levent M Arslan, “Robust processing techniques

for voice conversion,” Computer Speech & Language, vol. 20, no. 4,

pp. 441–467, 2006.

[251] Srinivas Desai, Alan W Black, B Yegnanarayana, and Kishore Prahal-

lad, “Spectral mapping using artiﬁcial neural networks for voice

conversion,” IEEE Transactions on Audio, Speech, and Language

Processing, vol. 18, no. 5, pp. 954–964, 2010.

[252] Masatsune Tamura, Takashi Masuko, Keiichi Tokuda, and Takao

Kobayashi, “Speaker adaptation for hmm-based speech synthesis

system using mllr,” in the third ESCA/COCOSDA Workshop (ETRW)

on Speech Synthesis, 1998.

[253] Volodya Grancharov, David Yuheng Zhao, Jonas Lindblom, and

W Bastiaan Kleijn, “Low-complexity, nonintrusive speech quality

assessment,” IEEE Transactions on Audio, Speech, and Language

Processing, vol. 14, no. 6, pp. 1948–1956, 2006.

[254] Mirjam Wester, Cassia Valentini-Botinhao, and Gustav Eje Henter,

“Are we using enough listeners? no!—an empirically-supported cri-

tique of interspeech 2014 tts evaluations,” in Sixteenth Annual

Conference of the International Speech Communication Association,

2015.

[255] Slawomir Zielinski, Philip Hardisty, Christopher Hummersone, and

Francis Rumsey, “Potential biases in mushra listening tests,” in Au-

dio Engineering Society Convention 123. Audio Engineering Society,

2007.

[256] Hadas Benisty and David Malah, “Voice conversion using gmm with

enhanced global variance,” in Twelfth Annual Conference of the

International Speech Communication Association, 2011.

[257] Jakub Vít, Zdenˇek Hanzlíˇcek, and Jindˇrich Matoušek, “On the

analysis of training data for wavenet-based speech synthesis,” in

2018 IEEE International Conference on Acoustics, Speech and Signal

Processing (ICASSP). IEEE, 2018, pp. 5684–5688.

[258] Meng Zhang, Jianhua Tao, Jilei Tian, and Xia Wang, “Text-

independent voice conversion based on state mapped codebook,” in

2008 IEEE International Conference on Acoustics, Speech and Signal

Processing. IEEE, 2008, pp. 4605–4608.

[259] ITUR Recommendation, “1534-1, Method for the subjective as-

sessment of intermediate sound quality (MUSHRA),” International

Telecommunications Union, Geneva, Switzerland, 2001.

[260] Antony W Rix, John G Beerends, Michael P Hollier, and Andries P

Hekstra, “Perceptual evaluation of speech quality (pesq)-a new

method for speech quality assessment of telephone networks and

codecs,” in 2001 IEEE International Conference on Acoustics, Speech,

and Signal Processing. Proceedings (Cat. No. 01CH37221). IEEE, 2001,

vol. 2, pp. 749–752.

[261] Szu-Wei Fu, Yu Tsao, Hsin-Te Hwang, and Hsin-Min Wang, “Quality-

net: An end-to-end non-intrusive speech quality assessment model

based on blstm,” arXiv preprint arXiv:1808.05344, 2018.

[262] Takenori Yoshimura, Gustav Eje Henter, Oliver Watts, Mirjam Wester,

Junichi Yamagishi, and Keiichi Tokuda, “A hierarchical predictor of

synthetic speech naturalness using neural networks.,” in INTER-

SPEECH, 2016, pp. 342–346.

[263] Brian Patton, Yannis Agiomyrgiannakis, Michael Terry, Kevin Wilson,

Rif A Saurous, and D Sculley, “Automos: Learning a non-intrusive

assessor of naturalness-of-speech,” arXiv preprint arXiv:1611.09207,

2016.

[264] Milos Cernak and Milan Rusko, “An evaluation of synthetic speech

using the pesq measure,” in Proc. European Congress on Acoustics,

2005, pp. 2725–2728.

[265] Dong-Yan Huang, “Prediction of perceived sound quality of syn-

thetic speech,” Proc. APSIPA, 2011.

[266] Ulpu Remes, Reima Karhila, and Mikko Kurimo, “Objective evalu-

ation measures for speaker-adaptive hmm-tts systems,” in Eighth

ISCA Workshop on Speech Synthesis, 2013.

[267] Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi

Yamagishi, Yu Tsao, and Hsin-Min Wang, “Mosnet: Deep learning

based objective assessment for voice conversion,” arXiv preprint

arXiv:1904.08352, 2019.

[268] Jennifer Williams, Joanna Rownicka, Pilar Oplustil, and Simon King,

“Comparison of speech representations for automatic quality esti-

mation in multi-speaker text-to-speech synthesis,” arXiv preprint

arXiv:2002.12645, 2020.

[269] Tomoki Toda, Ling-Hui Chen, Daisuke Saito, Fernando Villavicencio,

Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi, “The voice

conversion challenge 2016,” in Interspeech 2016, 2016, pp. 1632–

1636.

[270] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke

Saito, Fernando Villavicencio, Tomi Kinnunen, and Zhenhua Ling,

“The voice conversion challenge 2018: Promoting development of

parallel and nonparallel methods,” in Proc. Odyssey 2018 The Speaker

and Language Recognition Workshop, 2018, pp. 195–202.

[271] Zhizheng Wu, Nicholas Evans, Tomi Kinnunen, Junichi Yamagishi,

Federico Alegre, and Haizhou Li, “Spooﬁng and countermeasures

for speaker veriﬁcation: A survey,” Speech Communication, vol. 66,

pp. 130 – 153, 2015.

[272] Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi, “Analysis of the

voice conversion challenge 2016 evaluation results,” in Interspeech

2016, 2016, pp. 1637–1641.

[273] Kazuhiro Kobayashi, Shinnosuke Takamichi, Satoshi Nakamura, and

Tomoki Toda, “The nu-naist voice conversion system for the voice

conversion challenge 2016,” in Interspeech 2016, 2016, pp. 1667–

1671.

[274] Yichiao Wu, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro

Kobayashi, and Tomoki Toda, “The nu non-parallel voice conversion

system for the voice conversion challenge 2018,” in Proc. Odyssey

2018 The Speaker and Language Recognition Workshop, 2018, pp.

211–218.

[275] Li-Juan Liu, Zhen-Hua Ling, Yuan Jiang, Ming Zhou, and Li-Rong

Dai, “Wavenet vocoder with limited training data for voice conver-

sion,” in Proc. Interspeech 2018, 2018, pp. 1983–1987.

[276] J. Zhang, Z. Ling, L. Liu, Y. Jiang, and L. Dai, “Sequence-to-sequence

acoustic modeling for voice conversion,” IEEE/ACM Transactions on

Audio, Speech, and Language Processing, vol. 27, no. 3, pp. 631–644,

2019.

[277] J. Zhang, Z. Ling, and L. Dai, “Non-parallel sequence-to-sequence

voice conversion with disentangled linguistic and speaker represen-

tations,” IEEE/ACM Transactions on Audio, Speech, and Language

Processing, vol. 28, pp. 540–552, 2020.

[278] Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi,

Cemal Hanilçi, Md. Sahidullah, and Aleksandr Sizov, “ASVspoof 2015:

the ﬁrst automatic speaker veriﬁcation spooﬁng and countermea-

sures challenge,” in Proc. Interspeech, 2015, pp. 2037–2041.

[279] Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilçi, M. Sahidullah, A. Sizov,

N. Evans, M. Todisco, and H. Delgado, “Asvspoof: The automatic

speaker veriﬁcation spooﬁng and countermeasures challenge,” IEEE

Journal of Selected Topics in Signal Processing, vol. 11, no. 4, pp. 588–

604, 2017.

[280] Tomi Kinnunen, Md. Sahidullah, Héctor Delgado, Massimiliano

Todisco, Nicholas Evans, Junichi Yamagishi, and Kong-Aik Lee, “The

ASVspoof 2017 challenge: assessing the limits of replay spooﬁng

attack detection,” in Proc. Interspeech, 2017, pp. 2–6.

[281] Massimiliano Todisco, Xin Wang, Ville Vestman, Md. Sahidullah,

Héctor Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas

Evans, Tomi H. Kinnunen, and Kong Aik Lee, “ASVspoof 2019: future

horizons in spoofed and fake audio detection,” in Proc. Interspeech,

2019, pp. 1008–1012.

[282] Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Hector Delgado,

Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman,

Tomi Kinnunen, Kong Aik Lee, Lauri Juvela, Paavo Alku, Yu-Huai

Peng, Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, Sebastien Le Ma-

guer, Markus Becker, Fergus Henderson, Rob Clark, Yu Zhang, Quan

Wang, Ye Jia, Kai Onuma, Koji Mushika, Takashi Kaneda, Yuan Jiang,

Li-Juan Liu, Yi-Chiao Wu, Wen-Chin Huang, Tomoki Toda, Kou

Tanaka, Hirokazu Kameoka, Ingmar Steiner, Driss Matrouf, Jean-

Francois Bonastre, Avashna Govender, Srikanth Ronanki, Jing-Xuan

Zhang, and Zhen-Hua Ling, “Asvspoof 2019: a large-scale public

database of synthetic, converted and replayed speech,” 2019.

[283] John Kominek and Alan W Black, “The cmu arctic speech databases,”

in Fifth ISCA workshop on speech synthesis, 2004.

[284] Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald, et al.,

“Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning

toolkit,” 2016.

[285] Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia,

Zhifeng Chen, and Yonghui Wu, “LibriTTS: A Corpus Derived from

LibriSpeech for Text-to-Speech,” in Proc. Interspeech 2019, 2019, pp.

1526–1530.

[286] Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman,

“Voxceleb: Large-scale speaker veriﬁcation in the wild,” Computer

Speech & Language, vol. 60, pp. 101027, 2020.

[287] Kazuhiro Kobayashi and Tomoki Toda, “sprocket: Open-source

voice conversion software,” in Proc. Odyssey 2018 The Speaker and

Language Recognition Workshop, 2018, pp. 203–210.

[288] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro

Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann,

Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsub-

asa Ochiai, “Espnet: End-to-end speech processing toolkit,” in Proc.

Interspeech 2018, 2018, pp. 2207–2211.

Berrak Sisman received her PhD degree in Elec-

trical and Computer Engineering from National

University of Singapore in 2020, fully funded by

A*STAR Graduate Academy under Singapore Inter-

national Graduate Award (SINGA). She is currently

an Assistant Professor at Singapore University of

Technology and Design (SUTD). She is also an

Afﬁliated Researcher at the National University of

Singapore (NUS). Prior to joining SUTD, she was

a Postdoctoral Research Fellow at the National

University of Singapore. She was also an exchange

PhD student at the University of Edinburgh and a visiting scholar at

The Centre for Speech Technology Research, University of Edinburgh in

2019. She was attached to RIKEN Advanced Intelligence Project, Japan in

2018. Her research interests include speech information processing, ma-

chine learning, speech synthesis and voice conversion. She has published

in leading journals and conferences, including IEEE/ACM Transactions

on Audio, Speech and Language Processing, ASRU, INTERSPEECH and

ICASSP. She has served as the Local Arrangement Co-chair of IEEE ASRU

2019, Chair of Young Female Researchers Mentoring @ASRU2019, and

Chair of the INTERSPEECH Student Events in 2018 and 2019.

Junichi Yamagishi received the Ph.D. degree from

the Tokyo Institute of Technology (Tokyo Tech),

Tokyo, Japan, in 2006. He is currently a Professor

with the National Institute of Informatics, Tokyo,

Japan, and also a Senior Research Fellow with The

Centre for Speech Technology Research, The Uni-

versity of Edinburgh, Edinburgh, UK. Since 2006,

he has authored or co-authored over 250 refereed

papers in international journals and conferences.

Prof. Yamagishi was a recipient of the Tejima Prize

as the best Ph.D. thesis of Tokyo Tech in 2007. He

received the Itakura Prize from the Acoustic Society of Japan in 2010,

the Kiyasu Special Industrial Achievement Award from the Information

Processing Society of Japan in 2013, the Young Scientists’ Prize from the

Minister of Education, Science and Technology in 2014, the JSPS Prize

from the Japan Society for the Promotion of Science in 2016, and the

17th DOCOMO Mobile Science Award from the Mobile Communication

Fund, Japan in 2018. He was one of the organizers for special sessions

on Spooﬁng and Countermeasures for the Automatic Speaker Veriﬁcation

at INTERSPEECH 2013, the 1st/2nd/3rd ASVspoof Evaluation, the Voice

Conversion Challenge 2016/2018/2020, and the VoicePrivacy Challenge

2020. He was an Associate Editor of the IEEE/ACM Transactions on Audio,

Speech, and Language Processing, a Lead Guest Editor of the IEEE Journal

of Selected Topics in Signal Processing Special Issue on Spooﬁng and

Countermeasures for Automatic Speaker Veriﬁcation, and a member of

the Technical Committee of the IEEE Signal Processing Society Speech

and Language. He is now the Chairperson of ISCA Special Interest Group:

Speech Synthesis (SynSig), a member of the Technical Committee for the

Asia-Paciﬁc Signal and Information Processing Association Multimedia

Security and Forensics, an IEEE Senior Area Editor of the IEEE/ACM

Transaction on Audio, Speech, and Language Processing.

Simon King (M’95–SM’08–F’15) received the M.A.

(Cantab) and M.Phil. degrees from the University

of Cambridge, Cambridge, U.K., and the Ph.D.

degree from University of Edinburgh, Edinburgh,

U.K. He has been with the Centre for Speech

Technology Research, University of Edinburgh,

since 1993, where he is now Professor of Speech

Processing and the Director of the Centre. His

research interests include speech synthesis, recog-

nition and signal processing and he has around

230 publications across these areas. He has served

on the ISCA SynSIG Board and currently co-organises the Blizzard Chal-

lenge. He has previously served on the IEEE SLTC and as an Associate

Editor of the IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE

PROCESSING, and is currently an Associate Editor of Computer Speech

and Language.

Haizhou Li (M’91-SM’01-F’14) received the B.Sc.,

M.Sc., and Ph.D degree in electrical and elec-

tronic engineering from South China University of

Technology, Guangzhou, China in 1984, 1987, and

1990 respectively. Dr Li is currently a Professor at

the Department of Electrical and Computer Engi-

neering, National University of Singapore (NUS).

His research interests include automatic speech

recognition, speaker and language recognition,

and natural language processing. Prior to joining

NUS, he taught in the University of Hong Kong

(1988-1990) and South China University of Technology (1990-1994). He

was a Visiting Professor at CRIN in France (1994-1995), Research Manager

at the Apple-ISS Research Centre (1996-1998), Research Director in Lernout

& Hauspie Asia Paciﬁc (1999-2001), Vice President in InfoTalk Corp. Ltd.

(2001-2003), and the Principal Scientist and Department Head of Human

Language Technology in the Institute for Infocomm Research, Singapore

(2003-2016). Dr Li served as the Editor-in-Chief of IEEE/ACM Transactions

on Audio, Speech and Language Processing (2015-2018), a Member of the

Editorial Board of Computer Speech and Language (2012-2018), an elected

Member of IEEE Speech and Language Processing Technical Committee

(2013-2015), the President of the International Speech Communication As-

sociation (2015-2017), the President of Asia Paciﬁc Signal and Information

Processing Association (2015-2016), and the President of Asian Federation

of Natural Language Processing (2017-2018). He was the General Chair of

ACL 2012, INTERSPEECH 2014 and ASRU 2019. Dr Li is a Fellow of the IEEE

and the ISCA. He was a recipient of the National Infocomm Award 2002

and the President’s Technology Award 2013 in Singapore. He was named

one of the two Nokia Visiting Professors in 2009 by the Nokia Foundation,

and U Bremen Excellence Chair Professor in 2019.

Interactive tools for making temporally variable, multiple-attributes, and multiple-instances morphing accessible: Flexible manipulation of divergent speech instances for explorational research and education

Article

Full-text available

Jun 2024

We generalized a voice morphing algorithm capable of handling temporally variable, multiple-attributes, and multiple instances. The generalized morphing provides a new strategy for investigating speech diversity. However, excessive complexity and the difficulty of preparation have prevented researchers and students from enjoying its benefits. To address this issue, we introduced a set of interactive tools to make preparation and tests less cumbersome. These tools are integrated into our previously reported interactive tools as extensions. The introduction of the extended tools in lessons in graduate education was successful. Finally, we outline further extensions to explore excessively complex morphing parameter settings.

CCLCap-AE-AVSS: Cycle consistency loss based capsule autoencoders for audio–visual speech synthesis

Article

Full-text available

Jun 2024

Audio–visual speech synthesis (AVSS) is a rapidly growing field in the paradigm of audio–visual learning, involving the conversion of one person’s speech into the audio–visual stream of another while preserving the speech content. AVSS comprises two primary components: voice conversion (VC), which alters the vocal characteristics from the source speaker to the target speaker, followed by audio–visual synthesis, which creates the audio–visual presentation of the converted VC output for the target speaker. Despite the progress in deep learning (DL) technologies, DL models in AVSS have received limited attention in existing literature. Therefore, this article presents a novel approach for AVSS utilizing capsule network (Caps-Net)-based autoencoders, with the incorporation of cycle consistency loss. Caps-Net addresses translation invariance issues in convolutional neural network approaches for effective feature capture. Additionally, the inclusion of cycle consistency loss ensures the retention of content information from the source speaker. The proposed approach is referred to as cycle consistency loss-based capsule autoencoders for audio–visual speech synthesis (CCLCap-AE-AVSS). The proposed CCLCap-AE-AVSS is trained and tested using VoxCeleb2 and LRS3-TED datasets. The subjective and objective assessments of the generated samples demonstrate the superior performance of the proposed work compared to the current state-of-the-art models.

Cortical-striatal brain network distinguishes deepfake from real speaker identity

Article

Full-text available

Jun 2024

Deepfakes are viral ingredients of digital environments, and they can trick human cognition into misperceiving the fake as real. Here, we test the neurocognitive sensitivity of 25 participants to accept or reject person identities as recreated in audio deepfakes. We generate high-quality voice identity clones from natural speakers by using advanced deepfake technologies. During an identity matching task, participants show intermediate performance with deepfake voices, indicating levels of deception and resistance to deepfake identity spoofing. On the brain level, univariate and multivariate analyses consistently reveal a central cortico-striatal network that decoded the vocal acoustic pattern and deepfake-level (auditory cortex), as well as natural speaker identities (nucleus accumbens), which are valued for their social relevance. This network is embedded in a broader neural identity and object recognition network. Humans can thus be partly tricked by deepfakes, but the neurocognitive mechanisms identified during deepfake processing open windows for strengthening human resilience to fake information.

Wav2wav: Wave-to-Wave Voice Conversion

Article

Full-text available

May 2024

Voice conversion is the task of changing the speaker characteristics of input speech while preserving its linguistic content. It can be used in various areas, such as entertainment, medicine, and education. The quality of the converted speech is crucial for voice conversion algorithms to be useful in these various applications. Deep learning-based voice conversion algorithms, which have been showing promising results recently, generally consist of three modules: a feature extractor, feature converter, and vocoder. The feature extractor accepts the waveform as the input and extracts speech feature vectors for further processing. These speech feature vectors are later synthesized back into waveforms by the vocoder. The feature converter module performs the actual voice conversion; therefore, many previous studies separately focused on improving this module. These works combined the separately trained vocoder to synthesize the final waveform. Since the feature converter and the vocoder are trained independently, the output of the converter may not be compatible with the input of the vocoder, which causes performance degradation. Furthermore, most voice conversion algorithms utilize mel-spectrogram-based speech feature vectors without modification. These feature vectors have performed well in a variety of speech-processing areas but could be further optimized for voice conversion tasks. To address these problems, we propose a novel wave-to-wave (wav2wav) voice conversion method that integrates the feature extractor, the feature converter, and the vocoder into a single module and trains the system in an end-to-end manner. We evaluated the efficiency of the proposed method using the VCC2018 dataset.

Scalability and diversity of StarGANv2-VC in Arabic emotional voice conversion

Article

Jun 2024

MaskMel-Prosody-CycleGAN-VC: High-Quality Cross-Lingual Voice Conversion

Chapter

Jun 2024

Utilizing CNN Architectures for Non-invasive Diagnosis of Speech Disorders

Chapter

Jun 2024

Introduction to Audio Deepfake Generation: Academic Insights for Non-Experts

Conference Paper

Jun 2024

Multi-Level Information Aggregation Based Graph Attention Networks Towards Fake Speech Detection

Article

Jan 2024

It is widely acknowledged that distinguishing genuine speech from spoofed speech encompasses various subbands and temporal segments within speech signals. However, prevailing spoofing detection methods tend to oversimplify the relationships between these cues by employing linear models. In this paper, we introduce a multi-level information aggregation Graph Attention Networks (MiaGATs) to generate highly discriminative features for fake speech detection (FSD). In MiaGATs, each subband and temporal segment of a speech signal is represented as distinct nodes. MiaGATs incorporates channel information aggregation within each node to effectively harness the unique spectral and temporal characteristics during the feature encoding stage. In particular, MiaGATs address the interactions between nodes through indirect node aggregation and integrates both indirect and direct node aggregation by max-pooling operation. Experimental results on ASVspoof2019 and ASVspoof2021 LA databases show significant relative improvement compared to the current state-of-the-art. In comparison to the leading integrated spectro-temporal graph attention networks, MiaGATs gains an impressive performance improvement in various conditions, underscoring MiaGATs's position as a new benchmark in spoofing detection performance.

Automated Speech Correction Assistive Technology for Malayalam Articulation Errors

Conference Paper

Full-text available