PreprintPDF Available

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

Authors:
  • National Institute of Informatics, Japan
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Speaker identity is one of the important characteristics of human speech. In voice conversion, we change the speaker identity from one to another, while keeping the linguistic content unchanged. Voice conversion involves multiple speech processing techniques, such as speech analysis, spectral conversion, prosody conversion, speaker characterization, and vocoding. With the recent advances in theory and practice, we are now able to produce human-like voice quality with high speaker similarity. In this paper, we provide a comprehensive overview of the state-of-the-art of voice conversion techniques and their performance evaluation methods from the statistical approaches to deep learning, and discuss their promise and limitations. We will also report the recent Voice Conversion Challenges (VCC), the performance of the current state of technology, and provide a summary of the available resources for voice conversion research.
Content may be subject to copyright.
1
An Overview of Voice Conversion and
its Challenges: From Statistical Modeling to
Deep Learning
Berrak Sisman, Member, IEEE, Junichi Yamagishi, Senior Member, IEEE, Simon King, Fellow, IEEE,
and Haizhou Li, Fellow, IEEE
Abstract—Speaker identity is one of the important charac-
teristics of human speech. In voice conversion, we change the
speaker identity from one to another, while keeping the lin-
guistic content unchanged. Voice conversion involves multiple
speech processing techniques, such as speech analysis, spectral
conversion, prosody conversion, speaker characterization, and
vocoding. With the recent advances in theory and practice, we
are now able to produce human-like voice quality with high
speaker similarity. In this paper, we provide a comprehensive
overview of the state-of-the-art of voice conversion techniques
and their performance evaluation methods from the statistical
approaches to deep learning, and discuss their promise and
limitations. We will also report the recent Voice Conversion
Challenges (VCC), the performance of the current state of
technology, and provide a summary of the available resources
for voice conversion research.
Index Terms—Voice conversion, speech analysis, speaker
characterization, vocoding, voice conversion evaluation, voice
conversion challenges.
I. INTRODUCTION
Voice conversion (VC) is a significant aspect of artifi-
cial intelligence. It is the study of how to convert ones
voice to sound like that of another without changing the
linguistic content. Voice conversion belongs to a general
technical field of speech synthesis, which converts text to
speech or changes the properties of speech, for example,
voice identity, emotion, and accents. Stewart, a pioneer in
speech synthesis, commented in 1922 [1], the really difficult
problem involved in the the artificial production of speech-
sounds is not the making of a device which shall produce
speech, but in the manipulation of the apparatus. As voice
conversion is focused on the manipulation of voice identity
in speech, it represents one of the challenging research
problems in speech processing.
There has been a continuous effort in quest for effec-
tive manipulation of speech properties since the debut of
computer-based speech synthesis in the 1950s. The rapid
development of digital signal processing in the 1970s greatly
Berrak Sisman is with the Information Systems Technology and Design
(ISTD) Pillar of Singapore University of Technology and Design (SUTD),
Singapore.
Junichi Yamagishi is with National Institute of Informatics, Japan and
University of Edinburgh, United Kingdom.
Simon King is with the University of Edinburgh, United Kingdom.
Haizhou Li is with the Department of Electrical and Computer Engi-
neering, National University of Singapore.
facilitated the control of the parameters for speech manip-
ulation. While the original motivation of voice conversion
could be simply novelty and curiosity, the technological
advancements from statistical modeling to deep learning
have made a major impact on many real-life applications,
and benefited the consumers, such as personalized speech
synthesis [2], [3], communication aids for the speech-
impaired [4], speaker de-identification [5], voice mimicry [6]
and disguise [7], and voice dubbing for movies.
In general, a speaker can be characterized by three factors
that are 1) linguistic factors that are reflected in sentence
structure, lexical choice, and idiolect; 2) supra-segmental
factors such as the prosodic characteristics of a speech
signal, and 3) segmental factors that are related to short
term features, such as spectrum and formants. When the
linguistic content is fixed, the supra-segment and the seg-
mental factors are the relevant factors concerning speaker
individuality. An effective voice conversion technique is
expected to convert both the supra-segment and the seg-
mental factors. Despite much progress, voice conversion
is still far from perfect. In this paper, we celebrate the
technological advances, at the same time we expose their
limitations. We will discuss the state-of-the-art technology
from historical and technological perspectives.
A typical voice conversion pipeline includes a speech
analysis, mapping, and reconstruction modules as illus-
trated in Figure 1, that is referred to as analysis-mapping-
reconstruction pipeline. The speech analyzer decomposes
the speech signals of a source speaker into features that
represent supra-segmental and segmental information, and
the mapping module changes them towards the target
speaker, finally the reconstruction module re-synthesizes
time-domain speech signals. The mapping module has
taken the centre stage in many of the studies. These tech-
niques can be categorized in different ways, for example,
based on the use of training data - parallel vs non-parallel,
the type of statistical modeling technique - parametric vs
non-parametric, the scope of optimization - frame level vs
utterance level, and the workflow of conversion - direct
mapping vs inter-lingual. Let’s first give an account from
the perspective of the use of training data.
The early studies of voice conversion were focused
on spectrum mapping using parallel training data, where
speech of the same linguistic content is available from
both the source and target speaker, for example, vector
arXiv:2008.03648v1 [eess.AS] 9 Aug 2020
2
quantization (VQ) [8] and fuzzy vector quantization [9].
With parallel data, one can align the two utterances using
Dynamic Time Warping [10]. The statistical parametric ap-
proaches can benefit from more training data for improved
performance, just to name a few, Gaussian mixture model
[11]–[13], partial least square regression [14] and dynamic
kernel partial least squares regression (DKPLS) [15].
One of the successful statistical non-parametric tech-
niques is based on non-negative matrix factorization (NMF)
[16] and it is known as the exemplar-based sparse repre-
sentation technique [17]–[20]. It requires a smaller amount
of training data than the parametric techniques, and ad-
dresses well the over-smoothing problem. The family of
sparse representation techniques include phonetic sparse
representation, group sparsity implementation [21], [22],
that greatly improved the voice quality on small parallel
training dataset.
The studies on voice conversion towards non-parallel
training data [23]–[28] open up the opportunities for new
applications. The challenge is how to establish the mapping
between non-parallel source and target utterances. The
INCA alignment technique by Erro et al. [27] represents
one of the solutions to the non-parallel data alignment
problem [29]. With the alignment techniques, one is able
to extend the voice conversion techniques from parallel
data to non-parallel data, such as the extension to DKPLS
[30] and speaker model alignment method [31]. Phonetic
Posteriograms, or PPG-based [32], approach represents an-
other direction of research towards non-parallel training
data. While the alignment technique doesn’t use external
resources, the PPG-based approach makes use of auto-
matic speech recognizer to generate intermediate phonetic
representation [33], [34] as the inter-lingual between the
speakers. Successful applications include Phonetic Sparse
Representation [22].
Wu and Li [6], and Mohammadi and Kain [35] provided
an overview of voice conversion systems from the per-
spective of time alignment of speech features followed by
feature mapping, that represents the statistical modeling
school of thoughts. The advent of deep learning techniques
represents an important technology milestone in the voice
conversion research [36]. It has not only greatly advanced
the state-of-the-art, but also transformed the way we for-
mulate the voice conversion research problems. It also
opens up a new direction of research beyond the parallel
and non-parallel data paradigm. Nonetheless, the studies
on statistical modeling approaches have provided profound
insights into many aspects of the research problems that
serve as the foundation work of today’s deep learning
methodology. In this paper, we will give an overview of voice
conversion research by providing a perspective that reveals
the underlying design principles from statistical modeling
to deep learning.
Deep learning’s contributions to voice conversion can be
summarized in three areas. Firstly, it allows the mapping
module to learn from a large amount of speech data,
therefore, tremendously improves voice quality and simi-
larity to target speaker. With neural networks, we see the
mapping module as a nonlinear transformation function
[37], that is trained from data [38], [39]. LSTM represents a
successful implementation with parallel training data [40].
Deep learning made a great impact on non-parallel data
techniques. The joint use of DBLSTM and i-vector [41], KL
divergence and DNN-based approach [42], variational auto-
encoder [43], average modeling [44] and DBLSTM based
Recurrent Neural Networks [32], [45] bring the voice quality
to a new height. More recently, Generative Adversarial
Networks such as VAW-GAN [46], CycleGAN [47]–[49], and
StarGAN [50] further advance the state-of-the-art.
Secondly, deep learning has created a profound impact
on vocoding technology. Speech analysis and reconstruc-
tion modules are typically implemented using a traditional
parametric vocoder [11]–[13], [51]. The parameters of such
vocoders are manually tuned according to some over-
simplified assumptions in signal processing. As a result,
the parametric vocoders offer a suboptimal solution. Neural
vocoder is a neural network that learns to reconstruct an
audio waveform from acoustic features [52]. For the first
time, neural vocoder becomes trainable and data-driven.
WaveNet vocoder [53] represents one of the popular neural
vocoders, that directly estimates waveform samples from
the input feature vectors. It has been studied intensively,
for example, speaker dependent and independent WaveNet
vocoder [54], [55], quasi-periodic WaveNet vocoder [56],
[57], adaptive WaveNet vocoder with GANs [58], factorized
WaveNet vocoder [59], and refined WaveNet vocoder with
VAEs [60] that are known for their natural sounding voice
quality. WaveNet vocoder is also widely adopted in tradi-
tional voice conversion pipeline, such as GMM [54], sparse
representation [61], [62] systems. Other successful neural
vocoders include WaveRNN vocoder [63], WaveGlow [64],
that are excellent vocoders in their own right.
Thirdly, deep learning represents a departure from the
traditional analysis-mapping-reconstruction pipeline. All
the above techniques largely follow the voice conversion
pipeline as in Figure 1. As neural vocoder is trainable, it
can be trained jointly with mapping module [58] and even
with analysis module to become end-to-end solution [53].
Voice conversion research used to be a niche area in
speech synthesis. However, it has become a major topic
in recent years. In the 45th International Conference on
Acoustics, Speech, and Signal Processing (ICASSP 2020),
voice conversion papers represent more than one-third of
the papers under the speech synthesis category. The growth
of research community was accelerated by collaborative
activities across academia and industry, such as voice
conversion challenge (VCC) 2016, which was first launched
[65]–[67] at INTERSPEECH 2016. VCC 2016 is focused on the
most basic voice conversion task, that is voice conversion
for parallel training data recorded in acoustic studio. It
establishes the evaluation methodology and protocol for
performance benchmarking, that are adopted widely in the
community. VCC 2018 [68]–[70] proposes a non-parallel
training data challenge, and also connects voice conversion
with anti-spoofing of speaker verification studies. VCC 2020
puts forward a cross-lingual voice conversion challenge for
3
Training
Mapping Reconstruction
Analysis &
Feature Extraction
Target Speech
Source Speech Converted Speech
Conversion Model
Source Speech
Fig. 1: The typical flow of a voice conversion system. The pink box represents the training of the mapping function, while
the blue box applies the mapping function at run-time, in a 3-step pipeline process Y=(RFA)(X).
the first time. We will provide an overview of the series
of challenges and the publicly available resources in this
paper.
This paper is organized as follows: In Section II, we
present the typical flow of voice conversion that includes
feature extraction, feature mapping and waveform gener-
ation. In Section III, we study the statistical modeling for
voice conversion with parallel training data. In Section IV,
we study statistical modeling for voice conversion without
parallel training data. In Section V, we study the deep learn-
ing approaches for voice conversion with parallel training
data, and beyond parallel training data. In Section VI, we
explain the evaluation techniques for voice conversion. In
Section VII and VIII, we summarize the series of voice
conversion challenges, and publicly available research re-
sources for voice conversion. We conclude in Section IX.
II. TY PI CA L FLOW OF VOICE CO NV ER SI ON
The goal of voice conversion is to modify a source
speaker’s voice to sound as if it is produced by a target
speaker. In other words, a voice conversion system only
modifies the speaker-dependent characteristics of speech,
such as formants, fundamental frequency (F0), intonation,
intensity and duration, while carrying over the speaker-
independent speech content.
The core module of a voice conversion system performs
the conversion function. Let’s denote the source and target
speech signals as Xand Yrespectively. As will be discussed
later, voice conversion is typically applied to some inter-
mediate representation of speech, or speech feature, that
characterizes a speech frame. Let’s denote the source and
target speech features as xand y. The conversion function
can be formulated as follows,
y=F(x) (1)
where F(·) is also called mapping function in rest of this
paper. As illustrated in Figure 1, a typical voice conversion
framework is implemented in three steps: 1) speech analy-
sis, 2) feature mapping, and 3) speech reconstruction, that
we call the analysis-mapping-reconstruction pipeline. We
discuss in detail next.
A. Speech Analysis and Reconstruction
The speech analysis and reconstruction are two cru-
cial processes in the 3-step pipeline. The goal of speech
analysis is to decompose speech signals into some form
of intermediate representation for effective manipulation
or modification with respect to the acoustic properties of
speech. There have been many useful intermediate repre-
sentation techniques that were initially studied for speech
communication, and speech synthesis. They become handy
for voice conversion. In general, the techniques can be
categorized into model-based representations, and signal-
based representations.
In model-based representation, we assume that speech
signal is generated according to a underlying physical
model, such as source-filter model, and express a frame of
speech signal as a set of model parameters. By modifying
the parameters, we manipulate the input speech. In signal-
based representation, we don’t assume any models, but
rather represent speech as a composition of controllable
elements in time domain or frequency domain. Let’s denote
the intermediate representation for source speaker as x,
speech analysis can be described by a function,
x=A(X) (2)
Speech reconstruction can be seen as an inverse function
of the speech analysis, that operates on the modified
parameters and generates an audible speech signal. It works
with speech analysis in tandem. For example, A vocoder [51]
is used to express a speech frame with a set of controllable
parameters that can be converted back into a speech
waveform. A Griffin-Lim algorithm is used to reconstruct a
speech signal from a modified short-time Fourier transform
after amplitude modification [71]. As the output speech
quality is affected by the speech reconstruction process,
speech reconstruction is also one of the important topics
in voice conversion research. Let’s denote the modified
intermediate representation and the reconstructed speech
signal for target speaker as yand Y=R(y), voice conversion
can be described by a composition of three functions,
Y=(RFA)(X)
=C(X)(3)
that represents the typical flow of a voice conversion system
as a 3-step pipeline. As the mapping is applied frame-by-
frame, the number of converted speech features yis the
same as that of the source speech features xif speech
duration is not modified in the process.
4
While speech analysis and reconstruction make pos-
sible voice conversion, just like other signal processing
techniques, they inevitably also introduce artifacts. Many
studies were devoted to minimize such artifacts. We next
discuss the most commonly used speech analysis and
reconstruction techniques in voice conversion.
1) Signal-based Representation: Pitch Synchronous Over-
Lap and Add (PSOLA) is an example of signal-based rep-
resentation techniques. It decomposes a speech signal into
overlapping speech segments [72], each of which represents
one of the successive pitch periods of the speech signal. By
overlap-and-adding these speech segments with a different
pitch periods, we can reconstruct the speech signal of a dif-
ferent intonation. As PSOLA operates directly on the time-
domain speech signal [72], the analysis and reconstruction
do not introduce significant artifacts. While PSOLA tech-
nique is effective for modification of fundamental frequency
of speech signals, it suffers from several inherent limitations
[73], [74]. For example, unvoiced speech signal is not
periodic, and the manipulation of time-domain signal not
straightforward.
Harmonic plus Noise Model (HNM) represents another
signal-based representation approach. It works under the
assumption that a speech signal can be represented as
a harmonic component plus a noise component that is
delimited by the so-called maximum voiced frequency
[75]. The harmonic component is modeled as the sum of
harmonic sinusoids up to the maximum voiced frequency,
while the noise component is modeled as Gaussian noise
filtered by a time-varying autoregressive filter. As HNM
decomposition is represented by some controllable param-
eters, it allows for easy modification speech [76], [77].
2) Model-based Representation: The model-based tech-
nique assumes that the input signal can be mathematically
represented by a model whose parameters vary with time.
A typical example is the source-filter model that represents
a speech signal as the outcome of an excitation of the
larynx (source) modulated by a transfer (filter) function
determined by the shape of the supralaryngeal vocal tract. A
vocoder, a short form of voice coder, was initially developed
to minimize the amount of data that are transmitted for
voice communication. It encodes speech into slowly chang-
ing control parameters, such as linear predictive coding
and mel-log spectrum approximation [78], that describe the
filter, and re-synthesizes the speech signal with the source
information at the receiving end. In voice conversion, we
convert the speech signals from a source speaker to mimic
the target speaker by modifying the controllable parame-
ters.
The majority of vocoders are designed based on some
form of the source-filter model of speech production, such
as mixed excitation with a spectral envelope, and glottal
vocoders [79]. STRAIGHT or “Speech Transformation and
Representation using Adaptive Interpolation of weiGHTed
spectrum" is one of the popular vocoders in speech synthe-
sis and voice conversion [80]. It decomposes a speech signal
into: 1) a smooth spectrogram which is free from periodicity
in time and frequency; 2) a fundamental frequency (F0)
contour which is estimated using a fixed-point algorithm;
and 3) a time-frequency periodicity map which captures
the spectral shape of the noise and its temporal envelope.
STRAIGHT is widely used in voice conversion because its
parametric representation facilitates the statistical modeling
of speech, that allows for easy manipulation of speech [11],
[81], [82].
Parametric vocoders are widely adopted for analysis and
reconstruction of speech in voice conversion studies [8],
[9], [11], [12], [46], [47], [83], [84], and continue to play a
major role today [17], [21], [22]. The traditional parametric
vocoders are designed to approximate the complex me-
chanics of the human speech production under certain sim-
plified assumptions. For example, the interaction between
F0 and formant structure is ignored, and the original phase
structure is discarded [85]. The assumption of stationary
process in the short-time window, and time-invariant linear
filter, also give rise to robotic” and “buzzy” voice. Such
problems become more serious in voice conversion as we
modify both F0 and the formant structure of speech among
others at the same time. We believe that vocoding can
be improved by considering the interaction between the
parameters.
3) WaveNet Vocoder: Deep learning offers a solution to
some of the inherent problems of parametric vocoders.
WaveNet [53] is a deep neural network that learns to
generate high quality time-domain waveform. As it doesn’t
assume any mathematical model, it is a data-driven solu-
tion that requires a large amount of training data.
The joint probability of a waveform X=x1,x2,...,xNcan
be factorized as a product of conditional probabilities.
p(X)=
N
Y
n=1
p(xn|x1,x2,...,xn1) (4)
A WaveNet is constructed with many residual blocks, each
of which consists of 2 ×1 dilated causal convolutions,
a gated activation function and 1 ×1 convolutions. With
additional auxiliary features h, WaveNet can also model
conditional distribution p(x|h) [53]. Eq. (4) can then be
written as follows:
p(X|h)=
N
Y
n=1
p(xn|x1,x2,...,xn1,h) (5)
A typical parametric vocoder performs both analysis and
reconstruction of speech. However, most of today’s WaveNet
vocoders only cover the function of speech reconstruction.
It takes some intermediate representations of speech as
the input auxiliary features, and generate speech wave-
form as the output. WaveNet vocoder [55] outperforms
remarkably the traditional parametric vocoders in terms
of sound quality. Not only can it learn the relationship
between input features and output waveform, but also it
learns the interaction among the input features. It has been
successfully adopted as part of the state-of-the-art speech
synthesis [3], [86]–[89] and voice conversion [54], [55], [57],
[60]–[62], [86], [90]–[97] systems.
There have been promising studies on using vocoding
parameters as the intermediate representations in WaveNet
5
vocoding. A speaker independent WaveNet vocoder [55] is
studied by utilizing the STRAIGHT vocoding parameters,
such as F0, aperiodicity, and spectrum as the inputs of
WaveNet. In this way, WaveNet learns a sample-by-sample
correspondence between the time-domain waveform and
the input vocoding parameters. When such a WaveNet
vocoder is trained on speech signals from a large speaker
population, we obtain a speaker independent vocoder [55].
By adapting the speaker independent WaveNet vocoder
with speaker specific data, we obtain a speaker dependent
vocoder that generates personalized voice output [58], [60].
The study on WaveNet vocoder also opens up opportu-
nities for the use of other non-vocoding parameters as
the input. For example, a recent study adopts phonetic
posteriogram (PPG) in WaveNet vocoding with promising
results in voice conversion with non-parallel training data
[94]–[97]. Another study adopts latent code of autoencoder
and speaker embedding as the speech representation for
WaveNet vocoder [98].
4) Recent Progress on Neural Vocoders: More recently,
speaker independent WaveRNN-based neural vocoder [63]
became popular as it can generate human-like voices from
both in-domain and out-of-domain spectrogram [99]–[101].
Another well-known neural vocoder that achieves high-
quality synthesis performance is WaveGlow [64]. WaveGlow
is a flow-based network capable of generating high quality
speech from mel-spectrogram [102]. WaveGlow benefits
from the best of Glow and WaveNet so as to provide fast,
efficient and high-quality audio synthesis, without the need
for auto-regression. We note that WaveGlow is implemented
using only a single network with a single cost function, that
is to maximize the likelihood of the training data, which
makes the training procedure simple and stable [103].
WaveNet [53] uses an auto-regressive (AR) approach to
model the distribution of waveform sampling points, that
incurs a high computational cost. As an alternative to auto-
regression, a neural source-filter (NSF) waveform modeling
framework is proposed [104], [105]. We note that NSF is
straightforward to train and fast to generate waveform. It
is reported 100 times faster than WaveNet vocoder, and
yet achieving comparable voice quality on a large speech
corpus [106].
B. Feature Extraction
With speech analysis, we derive vocoding parameters
that usually contains spectral and prosodic components
to represent the input speech. The vocoding parameters
characterize the speech in a way that we can reconstruct the
speech signal later on after transmission. This is particularly
important in speech communication. However, such vocod-
ing parameters may not be the best for transformation of
voice identity. More often, the vocoding parameters are fur-
ther transformed into speech features, that we call feature
extraction in Figure 1, for more effective modification of the
acoustic properties in voice conversion.
For the spectral component, feature extraction aims
to derive low-dimensional representations from the high-
dimensional raw spectra. Generally speaking, the spectral
features are be able to represent the speaker individuality
well. The feature not only fit the spectral envelope well,
but also be able to be converted back to spectral envelope.
They should have good interpolation properties that allow
for flexible modification.
The magnitude spectrum can be warped to Mel or Bark
frequency scale that are perceptually meaningful for voice
conversion. It can also be transformed into cepstral domain
using a finite number of coefficients using the Discrete
Cosine Transform of log-magnitude. Cepstral coefficients
are less correlated. In this way, high dimension magnitude
spectrum is transformed to lower dimension feature rep-
resentation. The commonly used speech features include
Mel-cepstral coefficients (MCC), linear predictive cepstral
coefficients (LPCC), and line spectral frequencies (LSF).
Typically, a speech frame is represented by a feature vector.
Short-time analysis has been the most practical way
of speech analysis. Unfortunately it inherently ignores the
temporal context of speech, that is crucial in voice conver-
sion. Many studies have shown that multiple frames [18],
[107], dynamic features [62], and phonetic segments serve
as effective features in feature mapping.
For the prosodic component, feature extraction can be
used to decompose prosodic signal, such as fundamental
frequency (F0), aperiodicity (AP), and energy contours,
into speaker dependent and independent parameters [82].
In this way, we can carry over the speaker independent
prosodic patterns, while converting speaker dependent
ones during the feature mapping.
C. Feature Mapping
In the typical flow of voice conversion, feature mapping
performs the modification of speech features from source
to target speaker. Spectral mapping seeks to change the
voice timbre, while prosody conversion seeks to modify the
prosody features, such as fundamental frequency, intona-
tion and duration. So far, spectral mapping remains the
center of many voice conversion studies.
During training, we learn the mapping function, F(·)
in Eq.(1), from training data. At run time inference, the
mapping function transforms the acoustic features. A large
part of this paper is devoted to the study of the mapping
function. In Section III, we will discuss the traditional
statistical modeling techniques with parallel training data.
In Section IV, we will review the statistical modeling tech-
niques that do not require parallel training data. In Section
V, we will introduce a number of deep learning approaches,
which includes 1) parallel training data of paired speakers;
and 2) beyond parallel data of paired speakers.
III. STATISTICAL MODELING FOR VOICE CONVERSION WITH
PARALLEL TRAINING DATA
Most of the traditional voice conversion techniques as-
sume availability of parallel training data. In other words,
the mapping function is trained on paired utterances of
the same linguistic content spoken by source and target
speaker. Voice conversion studies started with statistical
6
approaches [108] in late 1980s, that can be grouped into
parametric and non-parametric mapping techniques. Para-
metric techniques makes assumptions about the under-
lying statistical distributions of speech features and their
mapping. Non-parametric ones make fewer assumptions
about the data, but seek to fit the training data with the
best mapping function, while maintaining some ability to
generalize to unseen data.
Parametric techniques, such as Gaussian mixture model
(GMM) [109], Dynamic Kernel Partial Least Square Regres-
sion, PSOLA mapping technique [73], represent a great
success in the recent past. The vector quantization ap-
proach to voice conversion is a typical non-parametric
technique. It maps codewords between source and target
codebooks [8]. In this method, a source feature vector
is approximated by the nearest codeword in the source
codebook, and mappped to the corresponding codeword
in the target codebook. To reduce the quantization error,
fuzzy vector quantization was studied [9], [110], where
continuous weights for individual clusters are determined
at each frame according to the source feature vector. The
converted feature vector is defined as a weighted sum of
the centroid vectors of the mapping codebook. Recently,
the non-negative factorization approach marks a successful
non-parametric implementation.
We will discuss a typical frame-level mapping paradigm
under the assumption of parallel training data, as illustrated
in Figure 2. During the training phase, given parallel train-
ing data from a source speaker xand a target speaker y,
frame alignment is performed to align the source speech
vectors and target speech vectors to obtain the paired
speech feature vector z={x,y}. Dynamic time warping
is feature-based alignment technique that is commonly
used. Speech recognizer, that is equipped with phonetic
knowledge, can also be used to perform model-based align-
ment. Frame alignment has been well studied in speech
processing. In voice conversion, a large body of literature
has been devoted to the design of frame-level mapping
function.
A. Gaussian Mixture Models
In Gaussian mixture modeling (GMM) approach to voice
conversion [109], we represent the relationship between
two sets of spectral envelopes, from source and target
speakers, using a Gaussian mixture model. The Gaussian
mixture model is a continuous parametric function, that
is trained to model the spectral mapping. In [109], har-
monic plus noise (HNM) features are used in the feature
mapping, which allows for high-quality modifications of
speech signals. The GMM approach is seen as an extension
to the vector quantization approach [8], [9], that results
in improved voice quality. However, the speech quality
is affected by some factors, e.g., spectral movement with
inappropriate dynamic characteristics caused by the frame-
by-frame conversion process, and excessive smoothing of
converted spectra [111]–[113].
To address the frame-by-frame conversion issue, a maxi-
mum likelihood estimation technique was studied to model
the spectral parameter trajectory [11]. This technique aims
to estimate an appropriate spectrum sequence using dy-
namic acoustic features. To address the over-smoothing
issue, or the muffled effect, joint density Gaussian mixture
model (JD-GMM) was studied [2], [11] to jointly model the
sequences of spectral features and their variances using
maximum likelihood estimation, that increases the global
variance of the spectral features. The JD-GMM method in-
volves two phases: off-line training and run-time conversion
phases. During the training phase, Gaussian mixture model
(GMM) is adopted to model the joint probability density
p(z) of the paired feature vector sequence z={x,y}, which
represents the joint distribution of source speech xand
target speech y:
p(z)=
K
X
k=1
w(z)
kN³z|µz
k,Σ(z)
k´(6)
µz
k=
µx
k
µy
k
,Σ(z)
k=
Σ(xx )
kΣ(x y)
k
Σ(yx)
kΣ(y y)
k
where Kis the number of Gaussian components, µz
kand
Σ(z)
kare the mean vector and the covariance matrix of the
kth Gaussian component N³z|µz
k,Σ(z)
k´, respectively. To es-
timate the model parameters of the JD-GMM, expectation-
maximization (EM) algorithm [114]–[117] is used to maxi-
mize likelihood on the training data.
A post-filter based on modulation spectrum modification
is found useful to address the inherent over-smoothing
issue in statistical modeling [118], such as GMM approach,
which effectively compensates the global variance. The
GMM approach is a parametric solution [119]–[123]. It
represents a successful statistical modeling technique that
works well with parallel training data.
B. Dynamic Kernel Partial Least Squares
The family of parametric techniques also include linear
[73], [74] or non-linear mapping functions. With the local
mapping functions, each frame of speech is typically trans-
formed independently from the neighboring frames, which
causes temporal discontinuities to the output [74].
To take into account the time-dependency between
speech features, a dynamic kernel partial least squares
(DKPLS) technique was studied [15]. This method is based
on a kernel transformation of the source features to allow
non-linear modeling, and concatenation adjacent frames to
model the dynamics. The non-linear transformation takes
advantage of the global properties of the data that GMM
approach doesn’t. It was reported that DKPLS outperforms
GMM approach [109] in terms of voice quality. This method
is simple and efficient, and does not require massive tuning.
More recently, DKPLS-based approaches are studied to
overcome the over-fitting and over-smoothing problems by
feature combination strategy [124].
While statistical modeling for the mapping of spectral
features has been well studied, conversion of prosody is
7
Reconstruction
Frame Alignment
Source Speech
Target Speech
Converted Speech
GMM
Frequency
Warping
NMF
DKPLS
Frame-level Mapping
Mapping
Analysis &
Feature Extraction
Source Speech
TRAINING PHASE
RUN-TIME CONVERSION PHASE
DNN
LSTM
(2) Deep Learning(1) Statistical Modeling
Fig. 2: Training and run-time inference of voice conversion with parallel training data under the frame-level mapping
paradigm. The pink boxes represent the training algorithms of the models that result in the mapping function F(x) in
blue box for run-time inference. Dotted box (1) includes examples of statistical approaches, and (2) includes examples
of deep learning approaches.
often achieved by simply shifting and scaling F0, which is
not sufficient for high-quality voice conversion. Hierarchical
modeling of prosody, for different linguistic units at several
distinct temporal scales, represents an advanced technique
for prosody conversion [82], [125]–[127]. DKPLS has cre-
ated a platform for multi-scale prosody conversion through
wavelet transform [128] that shows significant improvement
in naturalness over the F0 shifting and scaling technique.
C. Frequency Warping
Parametric techniques, such as GMM [109] and DKPLS
[15], usually suffer from over-smoothing because they use
the minimum mean square error [81] or the maximum
likelihood [11] function as the optimization criterion. As a
result, the system produces acoustic features that represent
statistical average, and fails to capture the desired details
of temporal and spectral dynamics.
Additionally, parametric techniques generally employ
low-dimensional features, as discussed in Section II.B, such
as the Mel cepstral coefficients (MCC) or line spectral
frequencies (LSF) to avoid the curse of dimensionality. The
low dimensional features, however, are doomed to lose
spectral details because they have low-resolution. Statistical
averaging and low-resolution features both lead to the
muffled effect of output speech [129].
To preserve the necessary spectral details during con-
version, a number of frequency warping-based methods
were introduced. The frequency warping technique directly
transforms the high resolution source spectrum to that of
the target speaker through a frequency warping function. In
recent literature, the warping function is either realized by
a single parameter, such as VTLN-based approaches [26],
[130]–[133], or represented as a piecewise linear function
[73], [129], [134], which has become a mainstream solution.
The goal of piecewise linear warping function is to align a
set of frequencies between the source and target spectrum
by minimizing the spectral distance or maximizing the
correlation between the converted and target spectrum.
More recently, the parametric frequency warping technique
was incorporated with a non-parametric exemplar-based
technique, that achieves good performance [107].
D. Non-negative Matrix Factorization
Non-negative matrix factorization (NMF) [135] is an ef-
fective data mining technique that has been widely used,
especially for reconstruction of high quality signals, such
as in speech enhancement [136], [137], speech de-noising
[138], [139], noise and speech estimation [140]. It factorizes
a matrix into two matrices, a dictionary and an activation
matrix, with the property that all three matrices have no
negative elements. The NMF-based techniques are shown
effective in voice conversion with very limited training data.
It marks a major progress of non-parametric approach
to voice conversion since vector quantization technique
was introduced. Successful implementation includes non-
negative spectrogram deconvolution [141], locally linear
embedding (LLE) [142], and unit selection [20]. In NMF-
based approaches, a target spectrogram is constructed
as a linear combination of exemplars. Therefore, over-
smoothing problem can also arise. To overcome the over-
smoothing problem, several effective techniques were de-
veloped, that we summarize next.
1) Sparse Representation: One effective way to alleviate
the over-smoothing problem is to apply sparsity constraint
to the activation matrix, referred to as exemplar-based
sparse representation.
As illustrated in Figure 3, a pair of dictionaries Aand B
are first constructed from speech feature vectors, that we
call aligned exemplars, from source and target. [A;B] is also
called the coupled dictionary. At run-time, let’s consider a
speech utterance as a sequence of speech feature vectors,
8
Source
Spectrogram
Converted
Spectrogram
Source
Dictionary
Target
Dictionary
COPY
X
X
Fixed Dictionaries
Fig. 3: Illustration of non-negative matrix factorization for
exemplar-based sparse representation.
that form a spectrogram matrix. The matrix of a source
utterance Xcan be represented as,
XAˆ
H(7)
Due to the non-negative nature of spectrogram, NMF tech-
nique is employed to estimate the source activation matrix
ˆ
H, which is constrained to be sparse. Mathematically, we
estimate ˆ
Hby minimizing an objective function,
ˆ
H=argmin
H0
d¡X,AH¢+λ||H|| (8)
where λis the sparsity penalty factor. To estimate activation
matrix ˆ
H, a generalised Kullback-Leibler (KL) divergence is
used. It is assumed that source and target dictionaries A
and Bcan share the same source activation matrix ˆ
H.
Therefore, the converted spectrogram for the target
speaker can be written as,
ˆ
Y=Bˆ
H. (9)
where the activation matrix ˆ
Hserves as the pivot to transfer
source utterance Xto target utterance Y.
The sparse representation framework continues to attract
much attention in voice conversion. The recent studies
include its extension to discriminative graph-embedded
NMF approach [19], phonetic sparse representation for
spectrum conversion [22], and its application to timbre and
prosody conversion [143], [144].
2) Phonetic Sparse Representation: As the frame-level
mapping is done at acoustic feature level, the coupled
dictionary [A;B] is therefore called acoustic dictionary.
With the scripts of the training data and a general purpose
speech recognizer, we are able to obtain phonetic labels
and their boundaries. Studies have shown that the strat-
egy of dictionary construction plays an important role in
voice conversion [145]. The idea of selecting sub-dictionary
according to the run-time speech content shows improved
performance [21].
Phonetic sparse representation [22] is an extension to
sparse representation for voice conversion. It is built on
the idea of phonetic sub-dictionaries, and dictionary selec-
tion at run-time. The study shows that multiple phonetic
sub-dictionaries consistently outperform single dictionary
in exemplar-based sparse representation voice conversion
[21], [22]. However, the phonetic sparse representation relies
on a speech recognizer at run-time to help select the sub-
dictionary.
3) Group Sparse Representation: Sisman et al. [62] pro-
posed group sparse representation to formulate both
exemplar-based sparse representation [141], and phonetic
sparse representation [22] under an unified mathematical
framework. With the group sparsity regularization, only
the phonetic sub-dictionary that is relevant to the input
features is likely to be activated at run-time inference. Un-
like phonetic sparse representation that relies on a speech
recognizer for both training and run-time inference, group
sparse representation only requires the speech recognizer
during training when we build the phonetic dictionary. It
was reported that group sparse representation provides sim-
ilar performance to that of phonetic sparse representation
when performing both spectrum and prosody conversion
[62].
IV. STATI ST IC AL MODELING FOR VOICE CONVERSION WITH
NON -PAR AL LE L TRAINING DATA
It is easy to understand that it is more straightforward
to train a mapping function from parallel than non-parallel
training data. However, parallel training data are not always
available. In real-world applications, there are situations
where only non-parallel data are available. Intuitively, if we
can derive the equivalents of speech frames or segments
between speakers from non-parallel data, we are able to
establish or to refine the mapping function using the con-
ventional linear transformation parameter training, such as
GMM, DKPLS or frequency warping.
There were a number of attempts to do so. For example,
one idea is to find source-target mapping between unsu-
pervised feature clusters [146]. Another is to use a speech
recognizer to index the target training data so that we can
retrieve similar frames from target database for a unknown
source frame at run-time [147]. Unfortunately, each of the
steps may produce errors that accumulate and may lead to
a poor parameter estimation [146]. There was also a study
to use a hidden Markov model (HMM) that is trained for the
target speaker, then the parameters of GMM-based linear
transformation function are estimated in such a way that
the converted source vectors exhibit maximum likelihood
with respect to the target HMM [148]. This method shows
comparable performance with methods of parallel data.
However, it requires that the orthography of the training
utterances be known, that limits its use.
Next we will discuss three clusters of studies and their
representative work, 1) INCA algorithm, 2) unit selection
algorithm, and 3) speaker modeling algorithm.
A. INCA Algorithm
INCA refers to an Iterative combination of a Nearest
Neighbor search step and a Conversion step Alignment
method [27]. It learns a mapping function by finding
the nearest neighbor of each source vector in the target
9
Non-parallel
Speech Data
GMM DKPLS ...
INCA
Algorithm
Fig. 4: The training of a frame-level mapping function is an
iterative process between the nearest neighbor search step
(INCA alignment) and the conversion step (a parametric
mapping function).
acoustic space. It is based on a hypothesis that an iter-
ative refinement of the basic nearest neighbour method,
in tandem with the voice conversion system, would lead
to a progressive alignment improvement. The main idea is
that the intermediate voice, xk
s, obtained after the previous
nearest neighbour alignment can be used as the source
voice during the next iteration.
xk+1
s=Fk(xk
s) (10)
During training, the optimization process is repeated until
the current intermediate voice, xk
s, is close enough to
target voice, yt. INCA represents a successful framework for
the non-parallel training data problem, where the nearest
neighbor search step (INCA alignment) and the conversion
step (a parametric mapping function) iterates to optimize
the mapping function, as illustrated in Figure 4.
INCA was first implemented with GMM approach [109]
for voice conversion to estimate a linear mapping func-
tion. As INCA does not require any phonetic or linguistic
information, it not only works for non-parallel training
data, but also works for cross-lingual voice conversion.
Experiments show that the INCA implementation of a cross-
lingual system achieves similar performance to its intra-
lingual counterpart that is trained on parallel data [27].
INCA was further implemented with DKPLS approach
[15] that was discussed in Section III.B for parallel training
data. The idea [30] is to use the INCA alignment algorithm
[27] to find the corresponding frames from the source and
target datasets, that allows the DKPLS regression to find a
non-linear mapping between the aligned datasets. It was re-
ported [30] that the INCA-DKPLS implementation produces
high-quality voice that is comparable to implementation
with parallel training data on the same amount of training
data.
Source Features Target Features
Target Speaker
Database
Dynamic
Programming
Fig. 5: Run-time inference of unit selection algorithm that
doesn’t model a mapping function with parameters, but
rather searches for output feature sequence directly from
target speaker database, and optimizes the output at utter-
ance level.
B. Unit Selection Algorithm
Unit selection algorithm has been widely used to generate
natural-sounding speech in speech synthesis. It is known
to produce high speaker similarity and voice quality [75],
[149], [150] because the synthesized waveform is formed
of sound units directly from the target speaker [151]. The
unit selection algorithm optimizes the unit selection from
a voice inventory of a target speaker. It was suggested
[152] to make use of unit selection synthesis system to
generate parallel versions of the training sentences from
non-parallel data. With the resulting pseudo-parallel data,
the statistical modeling techniques for parallel training data,
that we discuss in Section III, can be readily applied. While
this approach produces satisfactory voice quality [152], it
requires a large speech database to develop the the voice
inventory, that is not always practical in reality.
Another idea is to follow what we do in unit selection
speech synthesis by defining a speech feature vector as a
unit [24]. Given an utterance of Mspeech feature vectors
X={x1,x2,...,xM} from the source speaker, a dynamic pro-
gramming is applied to find the sequence of feature vectors
yifrom the target speaker, that minimizes a cost function,
Y=argmin
y³α
M
X
i=1
d1(xi,yi)+(1α)
M
X
i=2
d2(yi,yi1)´(11)
where d1(·) represents the acoustic distance between a
source and a target feature vector, while d2(·) is the con-
catenative cost between two target feature vectors. With
the acoustic distance, we make sure that the retrieved
speech features from the target speakers are close to those
of the source; with the concatenative cost, we encourage
the consecutive speech frames from the target speaker
database to be retrieved together in a multi-frame segment.
As illustrated in Figure 5, unit selection algorithm is a non-
parametric solution because we don’t model the conver-
sion with parameters. It optimizes the output by applying
a dynamic programming to find the best feature vector
sequence from the target speaker database. The mapping
function Y=F(X) is defined by the cost function Eq.11 itself,
and optimized at the utterance level.
10
C. Speaker Modeling Algorithm
The techniques for text-independent speaker character-
ization are readily available for non-parallel training data,
where a speaker can be modeled by a set of parameters,
such as a GMM or i-vector. One is possible to make use
such speaker models to perform voice conversion.
Mouchtaris et al. [153] used a GMM-based technique to
model relationship between reference speakers in advance
and apply the relationship for a new speaker. Toda et
al. [154] proposed an eigenvoice approach that performs
two mappings, one to map from the source speaker to
an eigenvoice (or average voice) trained from reference
speakers, and another from the eigenvoice to the target
speaker. These approaches dont require parallel training
data, they do require parallel data from some reference
speakers.
In speaker verification, the joint factor analysis method
[155] decomposes a supervector into speaker independent,
speaker dependent and channel dependent components,
each of which is represented by a low-dimensional set of
factors. This aims to disentangle speaker from other speech
content for effective speaker verification. Inspired by this
idea, we argue [156] that similar decomposition would be
useful in voice conversion, where we would like to separate
speaker information from the linguistic content, and apply
factor analysis on the speaker specific component.
With factor analysis, the speaker specific component
can be represented by a low-dimensional set of latent
variables via the factor loadings. One of the ideas [156] is
to estimate the phonetic component and factor loadings
from non-parallel prior data. In this way, during the training
process, we only estimate a low-dimensional set of speaker
identity factors and a tied covariance matrix instead of
a full conversion function from the source-target parallel
utterances. Even though parallel utterances are still required
for estimating the conversion function, the use of prior
data allows us to obtain a reliable model from much fewer
training samples than those required by conventional JD-
GMM [157].
Another idea is to perform the voice conversion in
i-vector [155] speaker space, where i-vector is used to
disentangle a speaker from the linguistic content. The
primary motivation is that an i-vector can be extracted in
an unsupervised manner regardless of speaker or speech
content, which opens up new possibilities especially for
non-parallel data scenarios where source and target speech
is of different content or even in different languages [28],
[45], [158]. Kinnunen et al. [159] studies a way to shift the
acoustic features of input speech towards target speech in
the i-vector space. The idea is to learn a function that maps
the i-vector of the source utterance to that of the target.
With the mapping function, we are able to convert the
source speech frame-by-frame to the target. This technique
is free of any parallel data, and text transcription.
V. DEEP LEARNING FOR VOICE CONVERSION
Voice conversion is typically a research problem with
scarce training data. Deep learning techniques are typi-
cally data driven, that rely on big data. However, this is
actually the strength of deep learning in voice conver-
sion. Deep learning opens up many possibilities to benefit
from abundantly available training data, so that the voice
conversion task can focus more on learning the mapping
of speaker characteristics. For example, it shouldnt be
the job of voice conversion task to infer low level detail
during speech reconstruction, a neural vocoder can learn
from large database to do so [98]. It shouldn’t be a task
of voice conversion to learn how to represent an entire
phonetic system of a spoken language, a general purpose
acoustic model of neural ASR [160] or TTS [161] system
can learn from a large database to do so. By leveraging
the large database, we free up the conversion network
from using its capacity to represent low level detail and
general information, but instead, to focus on the high level
semantics necessary for speaker identity conversion.
Deep learning techniques also transform the way we im-
plement the analysis-mapping-reconstruction pipeline. For
effective mapping, we need to derive adequate intermediate
representation of speech, that was discussed in Section II.
The concept of embedding in deep learning provides a
new way of deriving the intermediate representation, for
example, latent code for linguistic content, and speaker
embedding for speaker identity. It also makes the disen-
tanglement of speaker from content much easier.
In this section, we will summarize how deep learning
helps address existing research problems, such as parallel
and non-parallel data voice conversion. We will also review
how deep learning breaks new ground in voice conversion
research.
A. Deep Learning for Frame-Aligned Parallel Data
The study on deep learning approaches for voice con-
version started with parallel training data, where we use
a neural network as an improved regression function to
approximate the mapping function y=F(x) under the
frame-level mapping paradigm in Figure 2.
1) DNN Mapping Function: The early studies on DNN-
based voice conversion methods are focused on spectral
transformation. DNN mapping function, y=F(x), has some
clear advantage over other statistical models, such as GMM,
and DKPLS. For instance, it allows for non-linear mapping
between source and target features, and there is little
restriction to the dimension of features to be modeled. We
note that conversion on other acoustic features, such as
fundamental frequency and energy contour, can also be
done similarly [162].
Desai et al. [81] proposed a DNN to map a low-
dimensional spectral representation, such as mel-cepstral
coefficients (MCEP), from source to target speaker.
Nakashika et al. [163] proposed to use Deep Belief Nets
(DBNs) to extract latent features from source and target
cepstrum coefficients, and use a neural network with one
hidden layer to perform conversion between latent features.
Mohammadi et al. [164] furthered the idea by studying
a deep autoencoder from multiple speakers to derive a
11
compact representations of speech spectral feature. High-
dimensional representation of spectrum has also been used
in a more recent work [165] for spectral mapping, together
with dynamic features and a parameter generation algo-
rithm [166]. Chen et al. [167] proposed to model the distri-
butions of spectral envelopes of source and target speakers
respectively through a layer-wise generative training.
Generally speaking, DNN for spectrum and/or prosody
transformation requires a large amount of parallel training
data from paired speakers, which is not always feasible. But
it opens up opportunities for us to make use of speech data
from multiple speakers beyond source and target, to better
model the source and the target speakers, and to discover
better feature representations for feature mapping.
2) LSTM Mapping Function: To model the temporal
correlation across speech frames in voice conversion,
Nakashika et al. [168] explore the use of Recurrent Tem-
poral Restricted Boltzmann Machines (RTRBM), a type of
recurrent neural networks. The success of Long-Short Term
Memory (LSTM) [169], [170] in sequence to sequence mod-
eling inspires the study of LSTM in voice conversion, which
leads to an improvement of naturalness and continuity of
the speech output.
The LSTM network architecture consists of a set of
memory blocks and peephole connections, that support
the storage and access to long-range contextual information
[171] in linear memory cells. It learns the optimal amount of
contextual information for voice conversion. A bidirectional
LSTM (BLSTM) network is expected to capture sequential
information and maintain long-range contextual features
from both forward sequence and backward sequence [45].
Sun et al. [40] and Ming et al. [172] proposed a deep
bidirectional LSTM network (DBLSTM) by stacking multiple
hidden layers of BLSTM network architecture, that is shown
to outperform DNN voice conversion even without using
dynamic features. While DBLSTM-based voice conversion
approach generates high-quality synthesized voice, it typ-
ically requires a large speech corpus from source and
target speakers for training, that limits the scope of the
applications in practice [40].
Just like GMM approach, DNN and LSTM techniques rely
on external frame aligner during training data preparation,
as illustrated in Figure 2. At run-time, the conversion
process follows the typical flow of 3-step pipeline, and
doesn’t change the speech duration during the conversion.
B. Encoder-decoder with Attention for Parallel Data
The research problems of voice conversion are centered
around alignment and mapping, which are interrelated both
during training and at run-time inference, as illustrated in
Figure 2. During training, more accurate alignment helps
build better mapping function, that explains why we prefer
parallel training data. At run-time inference, the frame-
level mapping paradigm doesn’t change the duration of the
speech during the conversion. While it is possible to model
and predict the duration for voice conversion output, it
is not straightforward to incorporate duration model and
AttentionEncoder Decoder
Source
Speech
Converted
Speech
Fig. 6: Encoder-decoder mechanism with attention for voice
conversion.
mapping model in a systematic manner. Deep learning
provides a new solution to this research problem.
The attention mechanism [173], [174] in encoder-decoder
structure neural network brings about a paradigm change.
The idea of attention was first successfully used in machine
translation [173], speech recognition [175], and sequence-
to-sequence speech synthesis [86], [176]–[178], that led to
many parallel studies in voice conversion [179]–[181]. With
the attention mechanism, the neural network learns the
feature mapping and alignment at the same time during
training. At run-time inference, the network automatically
decides the output duration according to what it has learnt.
In other words, the frame-aligner in Figure 2 is no longer
required.
There are several variations based on recurrent neural
networks, such as SCENT [179], and AttS2S-VC [181]. They
follow the widely-used architecture of encoder-decoder with
attention [180], [182]. Suppose that we have a source speech
x={x1,x2,...,xTs}. The encoder network first transforms
the input feature sequences into hidden representations,
h={h1,h2,...,hTh} at a lower frame rate with Th<Ts, which
are suitable for the decoder to deal with. At each decoder
time step, the attention module aggregates the encoder
outputs by attention probabilities and produces a context
vector. Then, the decoder predicts output acoustic features
frame by frame using context vectors. Furthermore, a post-
filtering network is designed to enhance the accuracy of
the converted acoustic features to generate the converted
speech y={y1,y2,...,yTy}. During training, the attention
mechanism learns the mapping dynamics between source
sequence and target sequence. At run-time inference, the
decoder and the attention mechanism interacts to perform
the mapping and alignment at the same time. The overall
architecture is illustrated in Figure 6.
While recurrent neural networks represent an effective
implementation for sequence-to-sequence conversion, re-
cent studies have shown that convolutional neural networks
with gating mechanisms also learn well the long-term
dependencies [53], [183]. It employs an attention mecha-
nism that effectively makes possible parallel computations
for encoding and decoding. During decoding, the causal
convolution design allows the model to generate an output
sequence in an autoregressive manner. Kameoka et al. pro-
posed a convolutional neural networks implementation for
voice conversion [184], that is called ConvS2S-VC. Recent
studies show that ConvS2S-VC outperforms its recurrent
neural network counterparts in both pairwise and many-
12
Generator
Target Original
Features
Source Original
Features
Discriminator
Generator
Generator
Generator
Source Converted
Features Target Converted
Features
L1 L1
Converted
Features
Real, or not?
Discriminator
Converted
Features
Real, or not?
Source Speech
Target Speech
Training
Fig. 7: Training a CycleGAN with cycle-consistency loss of L1 norm for voice conversion with non-parallel training data
of paired speakers. L1 norm represents the least absolute errors
to-many voice conversion [181].
The encoder-decoder structure with attention marks a
departure from the frame-level mapping paradigm. The
attention doesn’t perform the mapping frame-by-frame, but
rather allows the decoder to attend to multiple speech
frames and uses the soft combination to predict an output
frame in the decoding process. With the attention mecha-
nism, the duration of the converted speech Tyis typically
different from that of the source speech Tsto reflect the
differences of speaking style between source and target.
This represents a way to handle both spectral and prosody
conversion at the same time. The studies have attributed
the improvement of voice quality to the effective attention
mechanism. The attention mechanism also represents the
first step towards relaxing the rigid requirement of parallel
data in voice conversion.
C. Beyond Parallel Data of Paired Speakers
In Section III and IV, we study statistical modeling
for voice conversion with parallel training data and non-
parallel training data. The advent of deep learning has
broken new ground for voice conversion research. We
now go beyond the paradigm of parallel and non-parallel
training data. We refer nonparallel training data to the case
where nonparallel utterances from source-target speakers
are required. However, the recent studies show that, deep
learning has enabled many voice conversion scenarios with-
out the need of parallel data. In this section, we summarize
the studies into four scenarios,
1) Non-parallel data of paired speakers,
2) Leveraging TTS systems,
3) Leveraging ASR systems, and
4) Disentangling speaker from linguistic content.
1) Non-parallel data of paired speakers: Voice conversion
with non-parallel training data is a task similar to image-to-
image translation, which is to find a mapping from a source
domain to a target domain without the need of parallel
training data. Let’s draw a parallel between image-to-image
translation and voice conversion. In image translation, we
would like to translate a horse to a zebra, where we preserve
the structure of horse and change the coat of horse to that
of zebra [185]–[190], in voice conversion, we would like to
transform one voice to that of another, while preserving the
linguistic, and prosodic content.
CycleGAN is based on the concept of adversarial learning
[191], which is to train a generative model to find a solution
in a min-max game between two neural networks, called as
generator (G) and discriminator (D). It is known to achieve
remarkable results [185] on several tasks where paired
training data does not exist, such as image manipulation
and synthesis [185], [188], [192]–[195], speech enhancement
[196], speech recognition [197], speech synthesis [198],
[199].
As the speech data are non-parallel, alignment is not eas-
ily achieved. Kaneko and Kameoka first studied a CycleGAN
[47], [48], [200], [201] that incorporates three loss func-
tions: adversarial loss, cycle-consistency loss, and identity-
mapping loss, to learn forward and inverse mapping be-
tween source and target speakers.
The adversarial loss measures how distinguishable be-
tween the data distribution of converted features and
source features xor target features y. For the forward
mapping, it is defined as follows:
LADV (GXY,DY,X,Y)=EyP(y)[DY(y)]
+ExP(x)[l og (1 DY(GXY(x))] (12)
The closer the distribution of converted data with that of
target data, the smaller the loss becomes.
The adversarial loss only tells us whether GXYfollows
the distribution of target data and does not ensure that
the contextual information, that represents the general
sentence structure we would like to carry over from source
to target, is preserved. To ensure that we maintain the
consistent contextual information between xand GXY(x),
the cycle-consistency loss, that is presented in Figure 7, is
introduced,
LC Y C (GXY,GYX)
=ExP(x)[kGYX(GXY(x)x)k1]
+EyP(y)[kGXY(GYX(y)y)k1] (13)
where k · k1refers to a L1 norm function, or least absolute
errors, that is known to produce sharper spectral features.
This loss encourages GXYand GYXto find an optimal
pseudo pair of (x,y) through circular conversion.
To encourage the generator to find the mapping that
preserves underlying linguistic content between the input
13
and output [202], an identity mapping loss is introduced as
follows,
LI D (GXY,GYX)
=ExP(x)[kGYX(x)xk]+EyP(y)[kGXY(y)yk] (14)
Combining the three loss functions, we have the total
loss as,
L(G,F,DX,DY,X,Y)=LGA N (G,DY,X,Y)
+LG AN (F,DX,X,Y)+λC Y C LCY C (G,F,X,Y)
+λI D LI D (G,F,X,Y) (15)
where λCY C and λI D are trade-off parameters.
The optimal mapping functions Gand Fare obtained
by solving the minmax-game defined as:
G,F=argmin
G,F
max
DX,DY
L(G,F,DX,DY,X,Y) (16)
CycleGAN represents a successful deep learning imple-
mentation to find an optimal pseudo pair from non-
parallel data of paired speakers. It doesnt require any frame
alignment mechanism such as dynamic time warping or
attention. Experimental results show that, with non-parallel
training data, CycleGAN achieves comparable performance
to that of GMM-based system that is trained on twice
amount of parallel data [47]. Moreover, with the adversarial
training, it effectively overcomes the over-smoothing prob-
lem, which is known to be one of the main factors leading
to speech-quality degradation. We note that more recently,
CycleGAN-VC2, an improved version of CycleGAN-VC has
been studied [201], that further improves CycleGAN by
incorporating three new techniques: an improved objective
(two-step adversarial losses), improved generator (2-1-2D
CNN), and improved discriminator (PatchGAN). CycleGAN
has been successfully applied in mono-lingual [48], [203],
cross-lingual voice conversion [204], emotional voice con-
version [205], [206] and rhythm-flexible voice conversion
[207].
Unlike the encoder-decoder structure, CycleGAN follows
a generative modeling architecture that doesn’t explicitly
model some internal representations to support flexible
manipulation, such as voice identity, duration of speech,
and emotion. Therefore, it is more suitable for voice conver-
sion between a specific source and target pair. Nonetheless,
it represents an important milestone towards non-parallel
data voice conversion.
2) Leveraging TTS systems: We have discussed the deep
learning architectures for voice conversion that do not in-
volve text. One of the important aspects of voice conversion
is to carry forward the linguistic content from source to
target. Voice conversion and TTS systems are similar in the
sense that they both aim to generate high quality speech
with the appropriate linguistic content. A TTS system pro-
vides a mechanism for the speech to adhere to the linguistic
content. The ideas to leverage TTS mechanism can be
motivated in different ways. Firstly, a TTS system is trained
on a large speech database that offers a high quality speech
re-construction mechanism given the linguistic content;
AttentionEncoder DecoderText Speech
AttentionEncoder Decoder
Source
Speech
Target
Speech
Shared
Decoder
Shared
Attention
Fig. 8: The upper panel is a TTS flow, and the lower panel
is a voice conversion flow. Both follow similar encoder-
decoder with attention architecture. The voice conversion
leverages the TTS system that is linguistically informed.
secondly, a TTS system is equipped with a quality attention
mechanism that is needed by voice conversion.
As illustrated in Figure 8, encoder-decoder models with
attention have recently shown considerable success in mod-
eling a variety of complex sequence-to-sequence problems.
Tacotron [87], [176], [208] represents one of the successful
text-to-speech (TTS) implementations, that has been ex-
tended to voice conversion [3], [179].
Zhang et al. proposed a joint training system architecture
for both text-to-speech and voice conversion [3] by ex-
tending the model architecture of Tacotron, which features
a multi-source sequence-to-sequence model with a dual
input, and dual attention mechanism. By taking only text
as input, the system performs speech synthesis. The system
can also take either voice alone, or both text and voice
as input for voice conversion. The multi-source encoder-
decoder model is trained with a decoder that is linguisti-
cally informed via the TTS joint training, as illustrated by
shared decoder in Figure 8. Experiments show that the joint
training has improved the voice conversion task with or
without text input at run-time inference.
Park et al. proposed a voice conversion system, known as
Cotatron, that is built on top of a multi-speaker Tacotron
TTS architecture [161]. At run-time inference, the pre-
trained TTS system is used to derive speaker-independent
linguistic features of the source speech. This process is
guided by the transcription of the input speech, as such,
text transcription of source speech is required at run-time
inference. The system uses the TTS encoder to extract
speaker-independent linguistic features, or disentangle the
speaker identity. The decoder then takes the attention-
aligned speaker-independent linguistic features as the in-
put, and the target speaker identity as the condition, to
generate a target speaker’s voice. In this way, voice conver-
sion leverage the attention mechanism or shared attention
from TTS, as shown in Figure 8. Cotatron is designed
to perform one-to-many voice conversion. A study [209],
that shares similar motivation with [161] but is based on
14
Fig. 9: Training phase of the average modeling approach
that maps PPG features to MCEP features for voice conver-
sion [44].
the Transformer instead of Tacotron, suggests transferring
knowledge from a learned TTS model to benefit from large-
scale, easily accessible TTS corpora.
Zhang et al. [210] proposed to improve the sequence-
to-sequence model [179] by using text supervision dur-
ing training. A multi-task learning structure is designed
which adds auxiliary classifiers to the middle layers of the
sequence-to-sequence model to predict linguistic labels as
a secondary task. The linguistic labels can be obtained
either manually or automatically with alignment tools. With
the linguistic label objective, the encoder and decoder are
expected to generate meaningful intermediate representa-
tions which are linguistically informed. The text transcripts
are only required during training. Experiments show that
the multi-task learning with linguistic labels effectively
improves the alignment quality of the model, thus alleviates
issues such as mispronunciation.
The neural representation of deep learning has facilitated
the interaction between TTS and voice conversion. By lever-
aging TTS systems, we hope to improve the training and
run-time inference of voice conversion with by adhering
to linguistic content. However, such techniques usually
require a large training corpus. Recent studies introduced a
framework for creating limited-data VC system [209], [211],
[212] by bootstrapping from a speaker-adaptive TTS model.
It deserves future studies as to how voice conversion can
benefit from TTS systems without involving large training
data.
3) Leveraging ASR systems: Deep learning approaches for
voice conversion typically require a large parallel corpus for
training. This is partly because we would like to learn the
latent representations that describe the phonetic systems.
The requirement of training data has limited the scope of
potential applications. We know that most ASR systems are
already trained with a large corpus. They already describe
well the phonetic systems in different ways. The question is
how to leverage the latent representations in ASR systems
for voice conversion.
One of the ideas is to use the context posterior proba-
bility sequence produced by the ASR model with sequence
to sequence learning to generate a target speech feature
sequence [160]. In this modal, the system has an encoder-
decoder structure similar to Figure 6, except that it uses a
speech recognizer as the encoder, and a speech synthesizer
as the decoder. Another study is to guide a sequence to
sequence voice conversion model by an ASR system, which
augments inputs with bottleneck features [179]. Recently,
an end-to-end speech-to-speech sequence transducer, Par-
rotron [213], was studied. Parrotron learns to convert
speech spectrogram of any speakers, with multiple accents
and imperfections, to the voice of a single predefined
target speaker. Parrotron accomplishes this by using an
auxiliary ASR decoder to predict the transcript of the output
speech, conditioned on the encoder latent representation.
The multi-task training of Parrotron optimizes the decoder
to generate the target voice, at the same time, constrains the
latent representation to retain linguistic information only.
The ASR decoder aims to disentangle the speaker’s identity
from the speech. The above techniques adopt the encoder-
decoder with attention architecture.
It is another way to look at voice conversion that speech
consists of two components, speaker dependent compo-
nent and speaker independent component. If we are able to
decompose speech signals into the two components, we can
carry over the former, and only convert the latter to achieve
voice conversion. The average modeling technique repre-
sents one of the successful implementations [41], where we
build a mapping function to convert phonetic posteriogram
(PPG) [32] to acoustic features. The PPG features are derived
from an ASR system, that can be considered as speaker
independent. We train the mapping function from multi-
speaker, non-parallel speech data. In this way, one doesn’t
need to train a full conversion model for each target
speaker. The average model can be adapted towards the
target with a small amount of target speech. The training
and adaptation of the average model are illustrated in
Figure 9.
There were several follow-up studies along this direction,
for example, Tian et al. proposes a PPG to waveform conver-
sion [94], and a average model with speaker identity [155]
as a condition [44]. Zhou et al. proposes to use PPG as the
linguistic features for cross-lingual voice conversion [158].
Liu et al. proposes to use PPG for emotional voice conver-
sion [214]. Zhang et al. also shows that the average model
framework can benefit from a small amount of parallel
training data using an error reduction network [215].
4) Disentangling speaker from linguistic content: In the
context of voice conversion, speech can be considered as
a composition of speaker voice identity and linguistic con-
tent. If we are able to disentangle speaker from the linguistic
content, we can change the speaker identity independently
of the linguistic content. Auto-encoder [216] represents one
of the common techniques for speech disentanglement, and
reconstruction. There are other techniques such as instance
normalization [217] and vector quantization [218], [219],
that are effective in disentangling speaker from the content.
An auto-encoder learns to reproduce its input as its
15
Content
Encoder
Decoder
Source
Speaker
Speech
Converted
Target
Speech
Speaker
Encoder
Target
Speaker
Speech
Latent
Code
Speaker
Embedding
Fig. 10: A typical auto-encoding network for voice conver-
sion, where the encoders and decoder learn to disentangle
speaker from linguistic content. At run-time, the linguistic
content of the source speech represented by latent code
and speaker embedding of a target speaker are combined
to generate target speech.
output. Therefore, parallel training data is not required. An
encoder learns to represent the input with a latent code,
and a decoder learns to reconstruct the original input from
the latent code. The latent code can be seen as an infor-
mation bottleneck which, on one hand, lets pass informa-
tion necessary, e.g. speaker independent linguistic content,
for perfect reconstruction, and on the other hand, forces
some information to be discarded, e.g. speaker, noise and
channel information [83]. Variational auto-encoder (VAE)
[220] is the stochastic version of auto-encoder, in which the
encoder produces distributions over latent representations,
rather than deterministic latent codes, while the decoder
is trained on samples from these distributions. Variational
auto-encoder is more suitable than deterministic auto-
encoder in synthesizing new samples.
Chorowski et al. [98] provides a comparison of three
auto-encoding neural networks by studying how they learn
a representation from speech data to separate speaker
identity from the linguistic content. It was shown that dis-
crete representation, that is the latent code obtained from
VQ-VAE, preserves the most linguistic content while also
being the most speaker-invariant. Recently, a group latent
embedding technique for VQ-VAE is studied to improve the
encoding process, which divides the embedding dictionary
into groups and uses the weighted average of atoms in the
nearest group as the latent embedding [221].
The concept of a VAE-based voice conversion frame-
work [43] can be illustrated in Figure 10. The decoder
reconstructs the utterance by conditioning on the latent
code extracted by the encoder, and separately on a speaker
code, which could be an one-hot vector [43], [222] for
a close set of speakers, or an i-vector [155], bottleneck
speaker representation [223], or d-vector [224] for an open
set of speakers. By explicitly conditioning the decoder on
speaker identity, the encoder is forced to capture speaker-
independent information in the latent code from a multi-
speaker database.
Just like other auto-encoder, VAE decoder tends to gen-
erate over-smoothed speech. This can be problematic for
voice conversion because the network may generate poor
quality buzzy-sounding speech. Generative adversarial net-
works (GANs) [225] were proposed as one of the solu-
tions to the over-smoothing problem. GANs offer a general
framework for training a data generator in such a way that
it can deceive a real/fake discriminator that attempts to
distinguish real data and fake data produced by the gener-
ator. By incorporating the GAN concept into VAE, VAE-GAN
was studied for voice conversion with non-parallel training
data [46] and in cross-lingual voice conversion [204]. It was
shown that VAE-GAN [225] produces more natural sounding
speech than the standard VAE method [43], [223].
A recent study on sequence-to-sequence non-parallel
voice conversion [226] shows that it is possible to explicitly
model the transfer of other aspects of speech, such as
source rhythm, speaking style, and emotion to the target
speech.
VI. EVALUATIO N OF VOICE CONVERSION
Effective quality assessment of voice quality is required
to validate the algorithms, to measure the technological
progress, and to benchmark a system against the state-of-
the-art. Typically, we report the results in terms of objective
and subjective measurements.
To provide an objective evaluation, a reference speech is
required. The common objective evaluation metrics include
Mel-cepstral distortion (MCD) [227] for spectrum, and PCC
[228] and RMSE [229]–[231] for prosody. We note that, such
metrics are not always correlated with human perception
partly because they measure the distortion of acoustic
features rather than the waveform that humans actually
listen to.
Subjective evaluation metrics, such as the mean opinion
score (MOS) [2], [232]–[234], preference tests [18], [235]
and best-worst scaling [236] could represent the intrinsic
naturalness and similarity to the target. We note that, for
subjective evaluation to be meaningful, a large number of
listeners are required, that is not always possible in practice.
A. Objective Evaluation
1) Spectrum Conversion: To provide an objective evalua-
tion, first of all, we need a reference utterance spoken by the
target speaker. Ideally the converted speech is very close to
the reference speech. We can measure the differences be-
tween them by comparing their spectral distances. However,
there is no guarantee that the converted speech and the
reference speech is of the same length. In this case, a frame
aligner is required to establish the frame-level mapping.
Mel-cepstral distortion (MCD) [227] is commonly used to
measure the difference between two spectral features [62],
[237]–[239]. It is calculated between the converted and
target Mel-cepstral coefficients, or MCEPs, [240], [241], ˆ
y
and y. Suppose that each MCEP vector consists of 24
coefficients, we have ˆ
y={mc
k,i} and y={mt
k,i} at frame k,
16
where idenotes the ith coefficient in the converted and
target MCEPs.
MC D [d B ]=10
ln10
v
u
u
t2
24
X
i=1
(mt
k,imc
k,i)2(17)
We note that a lower MCD indicates better performance.
However, MCD value is not always correlated with human
perception. Therefore, subjective evaluations, such as MOS
and similarity score, are also conducted.
2) Prosody Conversion: Speech prosody of an utterance
is characterized by phonetic duration, energy contour, and
pitch contour. To effectively measure how close the prosody
patterns of converted speech is to the reference speech, we
need to provide measurements for the three aspects.
The alignment between the converted speech and the
reference speech provides the information about how much
the phonetic duration differs one another. We can derive
the number of frames that deviate from the ideal diagonal
path on average, such as frame disturbance [242], to report
the differences of phonetic duration.
Pearson Correlation Coefficient (PCC) [62], [205] and Root
Mean Squared Error (RMSE) have been widely used as the
evaluation metrics to measure the linear dependence of
prosody contours or energy contours between two speech
utterances.
We next take the measurement of two prosody contours
as an example. PCC between the aligned pair of converted
and target F0 sequences is given as follows,
ρ(F0c,F0t)=cov(F0c,F0t)
σF0cσF0t
(18)
where σF0cand σF0tare the standard deviations of the
converted F0 sequences (F0c) and the target F0 sequences
(F0t), respectively. We note that a higher PCC value repre-
sents better F0 conversion performance.
The RMSE between the converted F0 and the correspond-
ing target F0 is defined as,
RMSE =v
u
u
t
1
K
K
X
k=1
(F0c
kF0t
k)2(19)
where F0c
kand F0t
kdenote the converted and target F0
features, respectively. Kis the length of F0 sequence, or
the total number of frames. We note that a lower RMSE
value represents better F0 conversion performance. The
same measurement applies to energy contours as well.
Other generally-accepted metrics for prosody transfer
include F0 Frame Error (FFE) [243] and Gross Pitch Error
(GPE) [244]. We note that GPE reports the percentage of
voiced frames whose pitch values are more than 20% dif-
ferent from the reference, while FFE reports the percentage
of frames that either contain a 20% pitch error or a voicing
decision error [245].
B. Subjective Evaluation
Mean Opinion Score (MOS) has been widely used in lis-
tening tests [40], [61], [62], [246]–[251]. In MOS experiments,
listeners rate the quality of the converted voice using a 5-
point scale: “5” for excellent, “4” for good, “3” for fair, “2”
for poor, and “1” for bad. There are several evaluation meth-
ods that are similar to MOS, for example: 1) DMOS [252]–
[254], which is a “degradation” or differential” MOS test,
requiring listeners to rate the sample with respect to this
reference, and 2) MUSHRA [255]–[257], which stands for
MUltiple Stimuli with Hidden Reference and Anchor, and
requires fewer participants than MOS to obtain statistically
significant results.
Another popular subjective evaluation is preference test,
also denoted as AB/ABX test [2], [11], [40], [258]. In AB tests,
listeners are presented with two speech samples and asked
to indicate which one has more of a certain property; for
example in terms of naturalness, or similarity. In ABX test,
similar to that of AB, two samples are given but an extra
reference sample is also given. Listeners need to judge if A
or B more like X in terms of naturalness, similarity, or even
emotional quality [205]. We note that it is not practical to
use AB and/or ABX test for the comparison of many VC
systems at the same time. MUSHRA is another type of voice
quality test in telecommunication [259], where the reference
natural speech and several other converted samples of the
same content are presented to the listeners in a random
order. The listeners are asked to rate the speech quality of
each sample between 0 and 100.
It is known that people are good at picking the extremes
but their preferences for anything in between might be
fuzzy and inaccurate when presented with a long list of
options. Best-Worst Scaling (BWS) [236] is proposed for
voice conversion quality assessment [22], where listeners
are presented only with a few randomly selected options
each time. With many such BWS decisions, Best-Worst
Scaling can handle a long list of options and generates more
discriminating results, such voice quality ranking, than MOS
and preference tests.
We note that subjective measures can represent the
intrinsic naturalness and similarity of a voice conversion
system. However, such evaluation can be time-consuming
and expensive as they involve a large number of listeners.
C. Evaluation with Deep Learning Approaches
The study of perceptual quality evaluation seeks to ap-
proximate human judgement with computational models
of psychoacoustic motivation. It provides insights into how
humans perceive speech quality in listening tests, and
suggests assessment metrics, that are required in speech
communication, speech enhancement, speech synthesis,
voice conversion and any other speech production or
transmission applications. Perceptual Evaluation of Speech
Quality (PESQ) [260] is an ITU-T recommendation that
is widely used as industry standard. It provides objec-
tive speech quality evaluation that predicts the human-
perceived speech quality.
However, the PESQ formulation requires the presence
of reference speech, that considerably restricts its use in
voice conversion applications, and motivates the study
17
of perceptual evaluations without the need of reference
speech. Those metrics that don’t require reference speech
are called non-intrusive evaluation metrics. For example,
Fu et al. [261] propose Quality-Net [261] that is an end-to-
end model to predict PESQ ratings, that are the proxy for
human ratings. Yoshimura et al. [262], Patton et al. [263]
propose a CNN-based naturalness predictor to predict hu-
man MOS ratings, among other non-intrusive assessment
metrics [264]–[266].
Lo et al. [267] propose MOSNet, another non-intrusive
assessment technique based on deep neural networks, that
learns to predict human MOS ratings. MOSNet scores are
highly correlated with human MOS ratings at system level,
and fairly correlated at utterance level. While it is a non-
intrusive evaluation metric for naturalness, MOSNet can
also be modified and re-purposed to predict the similarity
scores between target speech and converted speech. It
provides similarity scores with fair correlation values to
human ratings on VCC 2018 dataset. MOSNet marks a
recent advancement towards automatic perceptual quality
evaluation [268], which is free and open-source.
VII. VOICE CONVERSION CHALLENGES
In this section, we would like to give an overview of the
series of voice conversion challenges, that provide shared
tasks with common data sets and evaluation metrics for fair
comparison of algorithms. The voice conversion challenge
(VCC) is a biannual event since 2016. In a challenge,
a common database is provided by the organizers. The
participants build voice conversion systems using their own
technology, and the organizers evaluate the performance of
the converted speech. The main evaluation methodology is
a listening test in which crowd-sourced evaluators rank the
naturalness and speaker similarity.
The 2016 challenge offers a standard voice conversion
task using a parallel training database was adopted [269].
The 2018 challenge features a more advanced conversion
scenario using a non-parallel database [270]. The 2020
challenge puts forward a cross-lingual voice conversion
research problem. A summary of VCC 2016, VCC 2018 and
VCC 2020 is also provided in Table I.
A. Why is the Challenge Needed?
As described earlier, many of the voice conversion ap-
proaches are data-driven, hence speech data are required
to train models and for conversion evaluation. To compare
such data-driven methods each other precisely, a common
database that specifies training and evaluation data explic-
itly is needed. However, such common database did not ex-
ist until 2016. Without common databases, researchers have
to re-implement others’ system with their own databases
before trying any new ideas. In such situation, it is not
guaranteed that the re-implemented system achieves the
expected performance in the original work.
To address the same problem, the TTS community gave
birth to the first Blizzard challenge in 2005. Since then,
the challenge has defined various standard databases for
TTS and has made comparisons of TTS much fairer and
easier. The motivations of VCC are exactly the same as those
of the Blizzard challenges. VCC introduced a few standard
databases for voice conversion and also defined the com-
mon training and evaluation protocols. All the converted
speech submitted by the participants for the challenges
have been released publicly. In this way, researchers can
compare the performance of their voice conversion system
with that of other state-of-the-art systems without the need
of re-implementation.
Another need on voice conversion standard databases
arose from biometric speaker recognition community. As
the voice conversion technology could be misused for
attacking speaker verification systems, anti-spoofing coun-
termeasures are required [271]. This is also called presen-
tation attack detection. Anti-spoofing techniques aim at
discriminating between fake artificial inputs presented to
biometric authentication systems and genuine inputs. If
sufficient knowledge and data regarding the spoofed data
is available, a binary classifier can be constructed to reject
artificial inputs. Therefore, the common VCC databases
are also important for anti-spoofing research. With many
converted speech data from advanced voice conversion sys-
tems, researchers in the biometric community can develop
anti-spoofing models to strengthen the defence of speaker
recognition systems, and to evaluate their vulnerabilities.
B. Overview of the 2016 Voice Conversion Challenge
We first overview the 2016 voice conversion challenge
[269] and its datasets1. As the first shared task in voice
conversion, a parallel voice conversion task and its eval-
uation protocol are defined for VCC 2016. The parallel
dataset consists of 162 common sentences uttered by both
source and target speakers. Target and source speakers are
four native speakers of American English (two females and
two males), respectively. In the challenge, the participants
develop the conversion systems and produce converted
speech for all possible source-target pair combinations.
In total, eight speakers (plus two unused speakers) are
included in the VCC 2016 database. The number of test
sentences for evaluation is 54.
The main evaluation methodology adopted for the rank-
ing is subjective evaluation on perceived naturalness and
speaker similarity of the converted samples to target speak-
ers. The naturalness is evaluated using the standard five-
point scale mean-opinion score (MOS) test ranging from
1 (completely unnatural) to 5 (completely natural). The
speaker similarity was evaluated using the Same/Different
paradigm [272]. Subjects are asked to listen to two audio
samples and to judge if they are speech signals produced
by the same speaker in a four point scale: “Same, absolutely
sure”, Same, not sure, “Different, not sure” and Different,
absolutely sure. As the perceived speaker similarity to a
target speaker, and the perceived voice quality are not
necessarily correlated, it is important to use a scatter-plot
to observe the trade-off between the two aspects.
1The VCC2016 dataset is available at https://doi.org/10.7488/ds/1575
18
Challenge Language Task Training Data # Speakers Testing Data
VCC 2016 monolingual parallel 162 paired utterances 4 source, 4 target 54 utterances
VCC 2018 monolingual parallel 81 paired utterances 4 source, 4 target 35 utterances
monolingual nonparallel 81 unpaired utterances 4 source, 4 target 35 utterances
VCC 2020 monolingual parallel + nonparallel 20 paired, 50 unpaired utterances 4 source, 4 target 25 utterances
crosslingual nonparallel 70 unpaired utterances 4 source, 6 target 25 utterances
TABLE I: Summary of VCC 2016, VCC 2018 and VCC 2020.
In the 2016 challenge, 17 participants submitted their
conversion results. Two hundreds native listeners of English
joined the listening tests. It is reported that the best system
using GMM and waveform filtering obtained an average
of 3.0 in the five-point scale evaluation for the naturalness
judgement, and about 70% of its converted speech samples
are judged to be the same as target speakers by listeners.
However, it is also confirmed that there is still a huge gap
between target natural speech and the converted speech.
We observe that it remains a unsolved challenge to achieve
good quality and speaker similarity at that time. More
details of VCC 2016 can be found at [272]. Details of best
performing systems are reported in [273].
C. Overview of the 2018 Voice Conversion Challenge
Next we give an overview of the 2018 voice conversion
challenge [270] and its datasets2. VCC 2018 offers two tasks,
parallel and non-parallel voice conversion tasks. A dataset
and its evaluation protocol are defined for each task. The
dataset for the parallel conversion task is similar to that of
the 2016 challenge, except that it has a smaller number of
common utterances uttered by source and target speakers.
Target and source speakers are four native speakers of
American English (two females and two males), respectively,
but, they are different speakers from those used for the 2016
challenge. Like the 2016 challenge, the participants were
asked to develop conversion systems and to produce con-
verted data for all possible source-target pair combinations.
VCC 2018 introduced a non-parallel voice conversion task
for the first time. The same target speakers’ data in the
parallel task are used as the target. However, the source
speakers are four native speakers of American English (2
females and 2 males) different from those of the parallel
conversion task and their utterances are also all different
from those of the target speakers. Like the parallel voice
conversion task, converted data for all possible source-
target pair combinations needed to be produced by the
participants. In total twelve speakers are included in the
VCC 2018 database. Each of the source and target speakers
has a set of 81 sentences as training data, which is half
of that for VCC 2016. The number of test sentences for
evaluation is 35.
In the 2018 challenge, 23 participants submitted their
conversion results to the parallel conversion task, with
11 of them additionally participating in the non-parallel
conversion task. The same evaluation methodology as the
2016 challenge was adopted for the 2018 challenge and 260
2The VCC2018 dataset is available at https://doi.org/10.7488/ds/2337.
crowd-sourced native listeners of English have joined the
listening tests. It was reported that in both tasks, the best
system using phone encoder and neural vocoder obtained
an average of 4.1 in the five-point scale evaluation for
the naturalness judgement and about 80% of its converted
speech samples were judged to be the same as target speak-
ers by listeners. It was also reported that the best system has
similar performance in both the parallel and non-parallel
tasks in contrast to results reported in literature.
In VCC 2018, the spoofing countermeasure was intro-
duced as an supplement to subjective evaluation of voice
quality, that brought together the voice conversion and
speaker verification research community. More details of
the 2018 challenge can be found at [270]. Details of best
performing systems are reported in [274], [275].
From this challenge, we observed that new speech wave-
form generation paradigms such as WaveNet and phone
encoding have brought significant progress to the voice
conversion field. Further improvements have been achieved
in the follow up papers [276], [277] and new VC systems that
exceed the challenge’s best performance have already been
reported.
D. Overview of the 2020 Voice Conversion Challenge
The 2020 voice conversion challenge3consists of two
tasks: 1) non-parallel training in the same language (En-
glish); and 2) non-parallel training over different languages
(English-Finnish, English-German, and English-Mandarin).
In the first task, each participant trains voice conversion
models for all source and target speaker pairs using up
to 70 utterances, including 20 parallel utterances and 50
non-parallel utterances in English, for each speaker as the
training data. Overall, 16 voice conversion models (i.e., 4
sources by 4 targets) are to be developed. In the second
task, each participant develops voice conversion models for
all source and target speaker pairs using up to 70 utterances
for each speaker (i.e., in English for the source speakers, and
in Finnish, German, or Mandarin for the target speakers)
as the training data. Overall, 24 conversion systems (i.e., 4
sources by 6 targets) are to be developed.
In the 2020 challenge, the participants are allowed to
mix and combine different source speaker’s data to train
speaker-independent models. Moreover, the participants
can also use orthographic transcriptions of the released
training data to develop their voice conversion systems.
Last but not least, the participants are free to perform
manual annotations of the released training data, which can
effectively improves the quality of the converted speech.
3The 2020 VCC whitepaper: http://www.vc-challenge.org/rules.html.
19
The 2020 challenge organizers also built several baseline
systems including the top system of the previous challenge
on the new database. The codes of CycleVAE-based base-
line4and Cascade ASR + TTS based VC 5are released
so that participants can build the basic systems easily
and focus on their own innovation. The 2020 challenge
also features a multifaceted evaluation. In addition to the
traditional evaluation metrics, the challenge also reports the
speech recognition, speaker recognition, and anti-spoofing
evaluation results on the converted speech. The challenge
is underway at the time we submit this manuscript.
E. Relevant Challenges ASVspoof Challenge
The spoofing capability against automatic speaker verifi-
cation is a related topic to voice conversion, that has also
been organized as technology challenges. The ASVspoof
series of challenges are such biannual events, which started
in 2013. Like in the voice conversion challenges, the orga-
nizers release a common database including many pairs of
spoofed audio (converted, generated audio or replay audio)
and genuine audio to the participants, who build anti-
spoofing models using their own technology. The organizers
rank the detection accuracy of the anti-spoofing results
submitted by the participants.
In 2015, the first anti-spoofing database including various
types of spoofed audio using voice conversion and TTS
systems was constructed. This database became a reference
standard in the automatic speaker verification (ASV) com-
munity [278], [279]. The main focus of the 2017 challenge
was a replay task, where a large quantity of real-world
replay speech data were collected [280]. In 2019, an even
larger database including converted, generated, and replay
speech data was constructed [281]. The best performing
systems in the 2016 and 2018 voice conversion challenges
were also used for generating advanced spoofed audio [282].
The challenges revealed that some anti-spoofing systems
outperform human listeners in detecting spoofed audio.
VIII. RESO URCES
In addition to the voice conversion challenge databases
described above, the CMU-Arctic database [283] and the
VCTK databases [284] are also popular for voice conversion
research. The current version of the CMU-Arctic database6
has 18 English speakers and each of them reads out the
same set of around 1,150 utterances, which are carefully
selected from out-of-copyright texts from Project Guten-
berg. This is suitable for parallel voice conversion since
sentences are common to all the speakers. The current
version (ver. 0.92) of the CSTR VCTK corpus7has speech
data uttered by 110 English speakers with various dialects.
Each speaker reads out about 400 sentences, which are
selected from newspapers, the rainbow passage and an
elicitation paragraph used for the speech accent archive.
4https://github.com/bigpon/vcc20_baseline_cyclevae
5https://github.com/espnet/espnet/tree/master/egs/vcc20.
6http://www.festvox.org/cmu_arctic/
7https://doi.org/10.7488/ds/2645
Since the rainbow passage and an elicitation paragraph are
common to all the speakers, this database can be used for
both parallel and non-parallel voice conversion.
Since neural networks are data hungry and generalization
to unseen speakers is a key for successful conversion, large-
scale, but, low-quality databases such as LibriTTS and Vox-
Celeb are also used for training some components required
(e.g. speaker encoder) for voice conversion. The LibriTTS
corpus [285] has 585 hours of transcribed speech data
uttered by total of 2,456 speakers. The recording condition
and audio quality are less than ideal, but, this corpus is
suitable for training speaker encoder networks or general-
izing any-to-any speaker mapping network. The VoxCeleb
database [286] is further a larger scale speech database
consisting of about 2,800 hours of untranscribed speech
from over 6,000 speakers. This is an appropriate database
for training noise-robust speaker encoder networks.
There are many open-source codes for training VC
models. For instance, spocket [287] supports GMM-based
conversions and ESPnet [288] supports cascaded ASR and
TTS system. In addition, there are many open-source codes
for neural-network based voice conversion written by the
community at github8.
IX. CONCLUSION
This article provides a comprehensive overview of the
voice conversion technology, covering the fundamentals
and practice till July 2020. We reveal the underlying tech-
nologies and their relationship from the statistical ap-
proaches to deep learning, and discuss their promise and
limitations. We also study the evaluation techniques for
voice conversion. Moreover, we report the series of voice
conversion challenges and resources that are useful infor-
mation for researchers and engineers to start voice conver-
sion research.
REFERENCES
[1] John Q. Stewart, “An electrical analogue of the vocal organs, Nature,
vol. 110, pp. 311–312.
[2] Alexander Kain and Michael W Macon, “Spectral voice conversion
for text-to-speech synthesis,” in Proceedings of the 1998 IEEE
International Conference on Acoustics, Speech and Signal Processing,
ICASSP’98 (Cat. No. 98CH36181). IEEE, 1998, vol. 1, pp. 285–288.
[3] Mingyang Zhang, Xin Wang, Fuming Fang, Haizhou Li, and Junichi
Yamagishi, “Joint training framework for text-to-speech and voice
conversion using multi-source tacotron and wavenet,” arXiv preprint
arXiv:1903.12389, 2019.
[4] Christophe Veaux, Junichi Yamagishi, and Simon King, Towards
personalised synthesised voices for individuals with vocal disabili-
ties: Voice banking and reconstruction, 08 2013.
[5] Brij Srivastava, Nathalie Vauquier, Md Sahidullah, Aurélien Bel-
let, Marc Tommasi, and Emmanuel Vincent, “Evaluating voice
conversion-based privacy protection against informed attackers,” 11
2019.
[6] Zhizheng Wu and Haizhou Li, “Voice conversion versus speaker
verification: an overview, APSIPA Transactions on Signal and
Information Processing, vol. 3, pp. e17, 2014.
[7] Chien yu Huang, Yist Y. Lin, Hung yi Lee, and Lin shan Lee,
“Defending your voice: Adversarial attack on voice conversion,
ArXiv, vol. abs/2005.08781, 2020.
8https://paperswithcode.com/task/voice-conversion
20
[8] Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano, and Hisao
Kuwabara, “Voice conversion through vector quantization,” Journal
of the Acoustical Society of Japan (E), vol. 11, no. 2, pp. 71–76, 1990.
[9] Kiyohiro Shikano, Satoshi Nakamura, and Masanobu Abe, “Speaker
Adaptation and Voice Conversion by Codebook Mapping,” IEEE
International Sympoisum on Circuits and Systems, pp. 594–597, 1991.
[10] Elina Helander, Jan Schwarz, Jani Nurminen, Hanna Silen, and
Moncef Gabbouj, “On the impact of alignment on voice conversion
performance,” in Ninth Annual Conference of the International
Speech Communication Association, 2008.
[11] Tomoki Toda, Alan W. Black, and Keiichi Tokuda, “Voice conversion
based on maximum-likelihood estimation of spectral parameter
trajectory, IEEE Transactions on Audio, Speech and Language
Processing, vol. 15, no. 8, pp. 2222–2235, 2007.
[12] Heiga Zen, Yoshihiko Nankaku, and Keiichi Tokuda, “Probabilistic
feature mapping based on trajectory hmms,” in Ninth Annual
Conference of the International Speech Communication Association,
2008.
[13] Kazuhiro Kobayashi, Shinnosuke Takamichi, Satoshi Nakamura, and
Tomoki Toda, “The NU-NAIST voice conversion system for the Voice
Conversion Challenge 2016,” in INTERSPEECH, 2016.
[14] Elina Helander, Tuomas Virtanen, Jani Nurminen, and Moncef
Gabbouj, “Voice conversion using partial least squares regression,”
IEEE Transactions on Audio, Speech, and Language Processing, vol.
18, no. 5, pp. 912–921, 2010.
[15] Elina Helander, Hanna Silén, Tuomas Virtanen, and Moncef Gab-
bouj, Voice conversion using dynamic kernel partial least squares
regression,” IEEE transactions on audio, speech, and language
processing, vol. 20, no. 3, pp. 806–817, 2011.
[16] Yi Luan, Daisuke Saito, Yosuke Kashiwagi, Nobuaki Minematsu, and
Keikichi Hirose, “Semi-supervised noise dictionary adaptation for
exemplar-based noise robust speech recognition,” in 2014 IEEE
international conference on acoustics, speech and signal processing
(ICASSP). IEEE, 2014, pp. 1745–1748.
[17] Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki, “Exemplar-
based voice conversion in noisy environment,” In IEEE SLT, pp.
313–317, 2012.
[18] Zhizheng Wu, Tuomas Virtanen, Eng Siong Chng, and Haizhou Li,
“Exemplar-based sparse representation with residual compensation
for voice conversion,” IEEE/ACM Transactions on Audio, Speech and
Language Processing, vol. 22, no. 10, pp. 1506–1521, 2014.
[19] Ryo Aihara, Kenta Masaka, Tetsuya Takiguchi, and Yasuo Ariki,
“Parallel dictionary learning for multimodal voice conversion using
matrix factorization,” In INTERSPEECH, pp. 27–40, 2016.
[20] Zeyu Jin, Adam Finkelstein, Stephen DiVerdi, Jingwan Lu, and Gau-
tham J Mysore, “Cute: A concatenative method for voice conversion
using exemplar-based unit selection,” in 2016 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2016, pp. 5660–5664.
[21] Ryo Aihara, Toru Nakashika, Tetsuya Takiguchi, and Yasuo Ariki,
“Voice conversion based on non-negative matrix factorization using
phoneme-categorized dictionary, in 2014 IEEE International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
2014, pp. 7894–7898.
[22] Berrak Sisman, Haizhou Li, and Kay Chen Tan, “Sparse represen-
tation of phonetic features for voice conversion with and without
parallel data,” in 2017 IEEE Automatic Speech Recognition and
Understanding Workshop (ASRU). IEEE, 2017, pp. 677–684.
[23] Mikiko Mashimo, Tomoki Toda, Hiromichi Kawanami, Kiyohiro
Shikano, and Nick Campbell, “Cross-language voice conversion
evaluation using bilingual databases,” IPSJ Journal, 2002.
[24] David Sundermann, Harald Hoge, Antonio Bonafonte, Hermann
Ney, Alan Black, and Shri Narayanan, “Text-independent voice
conversion based on unit selection,” in 2006 IEEE International
Conference on Acoustics Speech and Signal Processing Proceedings.
IEEE, 2006, vol. 1, pp. I–I.
[25] Hao Wang, Frank Soong, and Helen Meng, “A spectral space warping
approach to cross-lingual voice transformation in hmm-based tts,”
in 2015 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). IEEE, 2015, pp. 4874–4878.
[26] David Sundermann, Hermann Ney, and H Hoge, “Vtln-based
crosslanguage voice conversion,” IEEE ASRU, 2003.
[27] D. Erro, A. Moreno, and A. Bonafonte, “Inca algorithm for training
voice conversion systems from nonparallel corpora,” IEEE Transac-
tions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp.
944–953, 2010.
[28] Daniel Erro and Asuncion Moreno, “Frame alignment method for
cross-lingual voice conversion,” INTERSPEECH, 1972.
[29] Jianhua Tao, Meng Zhang, Jani Nurminen, Jilei Tian, and Xia Wang,
“Supervisory data alignment for text-independent voice conversion,”
IEEE Transactions on Audio, Speech, and Language Processing, vol.
18, no. 5, pp. 932–943, 2010.
[30] Hanna Silen, Jani Nurminen, Elina Helander, and Moncef Gabbouj,
“Voice conversion for non-parallel datasets using dynamic kernel
partial least squares regression,” IEEE Transactions on Audio, Speech,
and Language Processing, vol. 20, no. 3, pp. 806–817, 2012.
[31] Peng Song, Yun Jin, Wenming Zheng, and Li Zhao, “Text-
independent voice conversion using speaker model alignment
method from non-parallel speech,” In Proceedings of the Annual
Conference of the International Speech Communication Association,
INTERSPEECH, , no. September, pp. 2308–2312, 2014.
[32] Lifa Sun, Kun Li, Hao Wang, Shiyin Kang, and Helen Meng, “Phonetic
posteriorgrams for many-to-one voice conversion without parallel
data training,” in 2016 IEEE International Conference on Multimedia
and Expo (ICME). IEEE, 2016, pp. 1–6.
[33] Timothy J Hazen, Wade Shen, and Christopher White, “Query-
by-example spoken term detection using phonetic posteriorgram
templates,” In IEEE ASRU, pp. 421–426, 2009.
[34] Keith Kintzley, Aren Jansen, and Hynek Hermansky, “Event selection
from phone posteriorgrams using matched filters,” In INTER-
SPEECH, pp. 1905–1908, 2011.
[35] Seyed Hamidreza Mohammadi and Alexander Kain, An overview
of voice conversion systems,” Speech Communication, vol. 88, pp.
65–82, 2017.
[36] M Narendranath, Hema A Murthy, S Rajendran, and B Yegna-
narayana, “Transformation of formants for voice conversion using
artificial neural networks,” Speech communication, vol. 16, no. 2,
pp. 207–216, 1995.
[37] Kurt Hornik, Maxwell Stinchcombe, and Halbert White, “Multi-
layer feedforward networks are universal approximators, Neural
networks, vol. 2, no. 5, pp. 359–366, 1989.
[38] Rabul Hussain Laskar, D Chakrabarty, Fazal Ahmed Talukdar,
K Sreenivasa Rao, and Kalyan Banerjee, “Comparing ann and gmm
in a voice conversion framework,” Applied Soft Computing, vol. 12,
no. 11, pp. 3332–3342, 2012.
[39] Hy Quy Nguyen, Siu Wa Lee, Xiaohai Tian, Minghui Dong, and
Eng Siong Chng, “High quality voice conversion using prosodic
and high-resolution spectral features, Multimedia Tools and Appli-
cations, vol. 75, no. 9, pp. 5265–5285, 2016.
[40] Lifa Sun, Shiyin Kang, Kun Li, and Helen Meng, “Voice conversion
using deep bidirectional long short-term memory based recurrent
neural networks,” in 2015 IEEE international conference on acoustics,
speech and signal processing (ICASSP). IEEE, 2015, pp. 4869–4873.
[41] Jie Wu, Zhizheng Wu, and Lei Xie, “On the use of I-vectors and
average voice model for voice conversion without parallel data,”
In IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2016.
[42] Feng Long Xie, Frank K. Soong, and Haifeng Li, “A KL divergence
and DNN-based approach to voice conversion without parallel
training sentences,” In Proceedings of the Annual Conference of the
International Speech Communication Association, INTERSPEECH,
vol. 08-12-September-2016, pp. 287–291, 2016.
[43] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-
Min Wang, “Voice conversion from non-parallel corpora using vari-
ational auto-encoder, in 2016 Asia-Pacific Signal and Information
Processing Association Annual Summit and Conference (APSIPA).
IEEE, 2016, pp. 1–6.
[44] Xiaohai Tian, Junchao Wang, Xu Haihua, Eng Siong Chng, and
Haizhou Li, Average Modeling Approach to Voice Conversion
with Non-Parallel Data, Odyssey 2018 The Speaker and Language
Recognition Workshop, pp. 1–10, 2018.
[45] Lifa Sun, Hao Wang, Shiyin Kang, Kun Li, and Helen Meng, “Per-
sonalized, cross-lingual TTS using phonetic posteriorgrams,” In
INTERSPEECH, pp. 322–326, 2016.
[46] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-
Min Wang, “Voice Conversion from Unaligned Corpora using Varia-
tional Autoencoding Wasserstein Generative Adversarial Networks,”
arXiv:1704.00849 [cs.CL], 2017.
[47] Takuhiro Kaneko and Hirokazu Kameoka, “Parallel-data-free voice
conversion using cycle-consistent adversarial networks,” arXiv
preprint arXiv:1711.11293, 2017.
[48] Fuming Fang, Junichi Yamagishi, Isao Echizen, and Jaime Lorenzo-
Trueba, “High-quality nonparallel voice conversion based on cycle-
21
consistent adversarial network,” in 2018 IEEE International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
2018, pp. 5279–5283.
[49] Jaime Lorenzo-Trueba, Fuming Fang, Xin Wang, Isao Echizen, Ju-
nichi Yamagishi, and Tomi Kinnunen, “Can we steal your vocal iden-
tity from the Internet?: Initial investigation of cloning Obama’s voice
using GAN, WaveNet and low-quality found data,” arXiv:1803.00860
[eess.AS], 2018.
[50] Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu
Hojo, “StarGAN-VC: Non-parallel many-to-many voice conversion
with star generative adversarial networks,” arXiv:1806.02169 [cs.SD],
2018.
[51] Manu Airaksinen, Lauri Juvela, Bajibabu Bollepalli, Junichi Yam-
agishi, and Paavo Alku, “A comparison between straight, glottal,
and sinusoidal vocoding in statistical parametric speech synthesis,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing,
vol. 26, no. 9, pp. 1658–1670, 2018.
[52] Xin Wang, Jaime Lorenzo-Trueba, Shinji Takaki, Lauri Juvela, and
Junichi Yamagishi, A comparison of recent waveform generation
and acoustic modeling methods for neural-network-based speech
synthesis,” in Proceedings of the IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP), Calgary, Canada,
April 2018, pp. 4804–4808.
[53] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan,
Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and
Koray Kavukcuoglu, “Wavenet: A generative model for raw audio,”
arXiv preprint arXiv:1609.03499, 2016.
[54] Akira Tamamori, Tomoki Hayashi, Kazuhiro Kobayashi, Kazuya
Takeda, and Tomoki Toda, “Speaker-dependent wavenet vocoder.,”
in Interspeech, 2017, vol. 2017, pp. 1118–1122.
[55] Tomoki Hayashi, Akira Tamamori, Kazuhiro Kobayashi, Kazuya
Takeda, and Tomoki Toda, “An investigation of multi-speaker
training for wavenet vocoder, in 2017 IEEE Automatic Speech
Recognition and Understanding Workshop (ASRU). IEEE, 2017, pp.
712–718.
[56] Yi-Chiao Wu, Tomoki Hayashi, Patrick Lumban Tobing, Kazuhiro
Kobayashi, and Tomoki Toda, “Quasi-periodic wavenet vocoder: a
pitch dependent dilated convolution model for parametric speech
generation,” arXiv preprint arXiv:1907.00797, 2019.
[57] Yi-Chiao Wu, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro
Kobayashi, and Tomoki Toda, “Statistical voice conversion with
quasi-periodic wavenet vocoder, arXiv preprint arXiv:1907.08940,
2019.
[58] Berrak Sisman, Mingyang Zhang, Sakriani Sakti, Haizhou Li, and
Satoshi Nakamura, Adaptive wavenet vocoder for residual com-
pensation in gan-based voice conversion,” in 2018 IEEE Spoken
Language Technology Workshop (SLT). IEEE, 2018, pp. 282–289.
[59] H. Du, X. Tian, L. Xie, and H. Li, “Wavenet factorization with singular
value decomposition for voice conversion,” in 2019 IEEE Automatic
Speech Recognition and Understanding Workshop (ASRU), 2019, pp.
152–159.
[60] Wen-Chin Huang, Yi-Chiao Wu, Hsin-Te Hwang, Patrick Lumban
Tobing, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda, Yu Tsao,
and Hsin-Min Wang, “Refined wavenet vocoder for variational
autoencoder based voice conversion,” in 2019 27th European Signal
Processing Conference (EUSIPCO). IEEE, 2019, pp. 1–5.
[61] Berrak Sisman, Mingyang Zhang, and Haizhou Li, “A voice con-
version framework with tandem feature sparse representation and
speaker-adapted wavenet vocoder., in Interspeech, 2018, pp. 1978–
1982.
[62] Berrak Sisman, Mingyang Zhang, and Haizhou Li, “Group Sparse
Representation with WaveNet Vocoder Adaptation for Spectrum and
Prosody Conversion,” IEEE/ACM Transactions on Audio, Speech and
Language Processing, 2019.
[63] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman
Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den
Oord, Sander Dieleman, and Koray Kavukcuoglu, “Efficient neural
audio synthesis,” arXiv preprint arXiv:1802.08435, 2018.
[64] Ryan Prenger, Raffael Valle, and Bryan Catanzaro, “WaveGlow: A
Flow-based Generative Network for Speech Synthesis,” in Proceed-
ings of the IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), Brighton, UK, May 2019, pp. 3617–3621.
[65] Tomoki Toda, Ling-Hui Chen, Daisuke Saito, Fernando Villavicencio,
Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi, “The voice
conversion challenge 2016.,” in Interspeech, 2016, pp. 1632–1636.
[66] Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi, “Multidimen-
sional scaling of systems in the voice conversion challenge 2016.,”
in SSW, 2016, pp. 38–43.
[67] Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi, Analysis of the
voice conversion challenge 2016 evaluation results., in Interspeech,
2016, pp. 1637–1641.
[68] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke
Saito, Fernando Villavicencio, Tomi Kinnunen, and Zhenhua Ling,
“The voice conversion challenge 2018: Promoting development of
parallel and nonparallel methods,” arXiv preprint arXiv:1804.04262,
2018.
[69] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke
Saito, Fernando Villavicencio, Tomi Kinnunen, Zhenhua Ling, et al.,
“the voice conversion challenge 2018: database and results, 2018.
[70] Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, Kazuhiro
Kobayashi, and Tomoki Toda, “Nu voice conversion system for the
voice conversion challenge 2018.,” in Odyssey, 2018, pp. 219–226.
[71] Daniel Griffin and Jae Lim, “Signal estimation from modified short-
time fourier transform,” IEEE Transactions on Acoustics, Speech, and
Signal Processing, vol. 32, no. 2, pp. 236–243, 1984.
[72] Eric Moulines and Francis Charpentier, “Pitch-synchronous wave-
form processing techniques for text-to-speech synthesis using di-
phones,” Speech communication, vol. 9, no. 5-6, pp. 453–467, 1990.
[73] Hélene Valbret, Eric Moulines, and Jean-Pierre Tubach, “Voice
transformation using psola technique,” Speech communication, vol.
11, no. 2-3, pp. 175–187, 1992.
[74] Levent M Arslan, “Speaker transformation algorithm using segmen-
tal codebooks (stasc),” Speech Communication, vol. 28, no. 3, pp.
211–226, 1999.
[75] Yannis Stylianou, Applying the harmonic plus noise model in
concatenative speech synthesis,” IEEE Transactions on speech and
audio processing, vol. 9, no. 1, pp. 21–29, 2001.
[76] Yannis Stylianou and Olivier Cappe, “A system for voice conversion
based on probabilistic classification and a harmonic plus noise
model,” in Proceedings of the 1998 IEEE International Conference
on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No.
98CH36181). IEEE, 1998, vol. 1, pp. 281–284.
[77] Daniel Erro and Asunción Moreno, “Weighted frequency warping for
voice conversion,” in Eighth Annual Conference of the International
Speech Communication Association, 2007.
[78] “Mel log spectrum approximation (mlsa) filter for speech synthesis,”
Electronics and Communications in Japan (Part I: Communications),
vol. 66, no. 2, pp. 10–18, 1983.
[79] M. Airaksinen, L. Juvela, B. Bollepalli, J. Yamagishi, and P. Alku, A
comparison between straight, glottal, and sinusoidal vocoding in
statistical parametric speech synthesis,” IEEE/ACM Transactions on
Audio, Speech, and Language Processing, vol. 26, no. 9, pp. 1658–
1670, 2018.
[80] Hideki Kawahara, Ikuyo Masuda-Katsuse, and Alain De Cheveigne,
“Restructuring speech representations using a pitch-adaptive time–
frequency smoothing and an instantaneous-frequency-based f0 ex-
traction: Possible role of a repetitive structure in sounds, Speech
communication, vol. 27, no. 3-4, pp. 187–207, 1999.
[81] Srinivas Desai, E Veera Raghavendra, B Yegnanarayana, Alan W
Black, and Kishore Prahallad, “Voice conversion using artificial neu-
ral networks,” in 2009 IEEE International Conference on Acoustics,
Speech and Signal Processing. IEEE, 2009, pp. 3893–3896.
[82] Berrak Sisman and Haizhou Li, Wavelet analysis of speaker
dependent and independent prosody for voice conversion.,” in
Interspeech, 2018, pp. 52–56.
[83] Wei-Ning Hsu, Yu Zhang, and James Glass, “Unsupervised learning
of disentangled and interpretable representations from sequential
data,” in Advances in neural information processing systems, 2017,
pp. 1878–1889.
[84] Wei-Ning Hsu, Yu Zhang, and James Glass, “Learning latent repre-
sentations for speech generation and transformation,” arXiv preprint
arXiv:1704.04222, 2017.
[85] Sadaoki Furui, “Digital speech processing, synthesis, and recogni-
tion(revised and expanded),” Digital Speech Processing, Synthesis,
and Recognition, 2000.
[86] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep
Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang,
RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and
Yonghui Wu, “Natural tts synthesis by conditioning wavenet on mel
spectrogram predictions,” arXiv:1712.05884, 2018.
22
[87] Rui Liu, Berrak Sisman, Jingdong Li, Feilong Bao, Guanglai Gao, and
Haizhou Li, “Teacher-student training for robust tacotron-based tts,”
arXiv preprint arXiv:1911.02839, 2019.
[88] Zden ˇek Hanzlíˇcek, Jakub Vít, and Daniel Tihelka, “Wavenet-based
speech synthesis applied to czech,” in International Conference on
Text, Speech, and Dialogue. Springer, 2018, pp. 445–452.
[89] Sercan Ö Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos,
Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng,
Jonathan Raiman, et al., “Deep voice: Real-time neural text-to-
speech,” in Proceedings of the 34th International Conference on
Machine Learning-Volume 70. JMLR. org, 2017, pp. 195–204.
[90] Berrak Sisman, Machine Learning for Limited Data Voice Conversion,
Ph.D. thesis, 2019.
[91] Kuan Chen, Bo Chen, Jiahao Lai, and Kai Yu, “High-quality voice
conversion using spectrogram-based wavenet vocoder., in Inter-
speech, 2018, pp. 1993–1997.
[92] Nagaraj Adiga, Vassilis Tsiaras, and Yannis Stylianou, “On the use
of wavenet as a statistical vocoder, in 2018 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2018, pp. 5674–5678.
[93] Yi Zhao, Shinji Takaki, Hieu-Thi Luong, Junichi Yamagishi, Daisuke
Saito, and Nobuaki Minematsu, “Wasserstein gan and waveform
loss-based acoustic model training for multi-speaker text-to-speech
synthesis systems using a wavenet vocoder, IEEE Access, vol. 6, pp.
60478–60488, 2018.
[94] Xiaohai Tian, Eng Siong Chng, and Haizhou Li, A speaker-
dependent wavenet for voice conversion with non-parallel data,”
Proceedings of the Interspeech, Graz, Austria, pp. 15–19, 2019.
[95] Hui Lu, Zhiyong Wu, Runnan Li, Shiyin Kang, Jia Jia, and Helen
Meng, A compact framework for voice conversion using wavenet
conditioned on phonetic posteriorgrams,” in ICASSP 2019-2019 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2019, pp. 6810–6814.
[96] Hongqiang Du, Xiaohai Tian, Lei Xie, and Haizhou Li, “Wavenet fac-
torization with singular value decomposition for voice conversion,”
in 2019 IEEE Automatic Speech Recognition and Understanding
Workshop (ASRU). IEEE, 2019, pp. 152–159.
[97] Songxiang Liu, Yuewen Cao, Xixin Wu, Lifa Sun, Xunying Liu,
and Helen Meng, “Jointly trained conversion model and wavenet
vocoder for non-parallel voice conversion using mel-spectrograms
and phonetic posteriorgrams,” Proc. Interspeech 2019, pp. 714–718,
2019.
[98] Jan Chorowski, Ron Weiss, Samy Bengio, and Aaron Oord, “Unsuper-
vised speech representation learning using wavenet autoencoders,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing,
vol. PP, pp. 1–1, 09 2019.
[99] Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas
Merritt, Bartosz Putrycz, Roberto Barra-Chicote, Alexis Moinet, and
Vatsal Aggarwal, “Towards achieving robust universal neural vocod-
ing,” in Proc. Interspeech, 2019, vol. 2019, pp. 181–185.
[100] Prachi Govalkar, Johannes Fischer, Frank Zalkow, and Christian
Dittmar, A comparison of recent neural vocoders for speech signal
reconstruction,” in Proc. 10th ISCA Speech Synthesis Workshop, 2019,
pp. 7–12.
[101] Yuan-Hao Yi, Yang Ai, Zhen-Hua Ling, and Li-Rong Dai, “Singing
voice synthesis using deep autoregressive neural networks for acous-
tic modeling,” arXiv preprint arXiv:1906.08977, 2019.
[102] Takuma Okamoto, Tomoki Toda, Yoshinori Shiga, and Hisashi Kawai,
“Real-time neural text-to-speech with sequence-to-sequence acous-
tic model and waveglow or single gaussian wavernn vocoders,” in
Proc. Interspeech, 2019, vol. 2019, pp. 1308–1312.
[103] Soumi Maiti and Michael I Mandel, “Parametric resynthesis with
neural vocoders,” in 2019 IEEE Workshop on Applications of Signal
Processing to Audio and Acoustics (WASPAA). IEEE, 2019, pp. 303–
307.
[104] Xin Wang, Shinji Takaki, and Junichi Yamagishi, “Neural source-
filter-based waveform model for statistical parametric speech syn-
thesis,” in ICASSP 2019-2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp.
5916–5920.
[105] Xin Wang and Junichi Yamagishi, “Neural harmonic-plus-noise
waveform model with trainable maximum voice frequency for text-
to-speech synthesis,” arXiv preprint arXiv:1908.10256, 2019.
[106] Xin Wang, Shinji Takaki, and Junichi Yamagishi, “Neural source-
filter waveform models for statistical parametric speech synthesis,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing,
vol. 28, pp. 402–415, 2019.
[107] Xiaohai Tian, Siu Wa Lee, Zhizheng Wu, Eng Siong Chng, Senior
Member, and Haizhou Li, “An Exemplar-based Approach to Fre-
quency Warping for Voice Conversion,” pp. 1–10, 2016.
[108] Hisao Kuwabara and Yoshinori Sagisak, “Acoustic characteristics of
speaker individuality: Control and conversion,” Speech communica-
tion, vol. 16, no. 2, pp. 165–173, 1995.
[109] Yannis Stylianou, Olivier Cappé, and Eric Moulines, “Continuous
probabilistic transform for voice conversion,” IEEE Transactions on
speech and audio processing, vol. 6, no. 2, pp. 131–142, 1998.
[110] Hiroshi Matsumoto and Yasuki Yamashita, “Unsupervised speaker
adaptation from short utterances based on a minimized fuzzy
objective function,” Journal of the Acoustical Society of Japan (E),
vol. 14, no. 5, pp. 353–361, 1993.
[111] Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano, “Voice con-
version algorithm based on gaussian mixture model with dynamic
frequency warping of straight spectrum,” in 2001 IEEE International
Conference on Acoustics, Speech, and Signal Processing. Proceedings
(Cat. No. 01CH37221). IEEE, 2001, vol. 2, pp. 841–844.
[112] Tomoki Toda, Jinlin Lu, Satoshi Nakamura, and Kiyohiro Shikano,
“Voice conversion algorithm based on gaussian mixture model
applied to straight,” 2000.
[113] Tomoki Toda, Alan W Black, and Keiichi Tokuda, “Spectral conver-
sion based on maximum likelihood estimation considering global
variance of converted parameter, in Proceedings.(ICASSP’05). IEEE
International Conference on Acoustics, Speech, and Signal Processing,
2005. IEEE, 2005, vol. 1, pp. I–9.
[114] Todd K Moon, “The expectation-maximization algorithm,” IEEE
Signal processing magazine, vol. 13, no. 6, pp. 47–60, 1996.
[115] Chuong B Do and Serafim Batzoglou, “What is the expectation
maximization algorithm?,” Nature biotechnology, vol. 26, no. 8, pp.
897–899, 2008.
[116] Guorong Xuan, Wei Zhang, and Peiqi Chai, “Em algorithms of gaus-
sian mixture model and hidden markov model, in Proceedings 2001
International Conference on Image Processing (Cat. No. 01CH37205).
IEEE, 2001, vol. 1, pp. 145–148.
[117] Maya R Gupta, Yihua Chen, et al., “Theory and use of the em
algorithm,” Foundations and Trends® in Signal Processing, vol. 4,
no. 3, pp. 223–296, 2011.
[118] Shinnosuke Takamichi, Tomoki Toda, Alan W Black, and Satoshi
Nakamura, “Modulation spectrum-based post-filter for gmm-based
voice conversion,” in Signal and Information Processing Association
Annual Summit and Conference (APSIPA), 2014 Asia-Pacific. IEEE,
2014, pp. 1–4.
[119] Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro
Shikano, “Maximum likelihood voice conversion based on gmm
with straight mixed excitation,” 2006.
[120] Hiromichi Kawanami, Yohei Iwami, Tomoki Toda, Hiroshi
Saruwatari, and Kiyohiro Shikano, “Gmm-based voice conversion
applied to emotional speech synthesis,” in Eighth European
Conference on Speech Communication and Technology, 2003.
[121] Ryo Aihara, Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo
Ariki, “Gmm-based emotional voice conversion using spectrum and
prosody features,” American Journal of Signal Processing, vol. 2, no.
5, pp. 134–138, 2012.
[122] Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, Yih-Ru Wang, and Sin-
Horng Chen, “Incorporating global variance in the training phase
of gmm-based voice conversion,” in 2013 Asia-Pacific Signal and
Information Processing Association Annual Summit and Conference.
IEEE, 2013, pp. 1–6.
[123] Tudor-C˘at˘alin Zoril ˘a, Daniel Erro, and Inma Hernáez, “Improving
the quality of standard gmm-based voice conversion systems by
considering physically motivated linear transformations,” in Ad-
vances in Speech and Language Technologies for Iberian Languages,
pp. 30–39. Springer, 2012.
[124] Mostafa Ghorbandoost, Abolghasem Sayadiyan, Mohsen Ahangar,
Hamid Sheikhzadeh, Abdoreza Sabzi Shahrebabaki, and Jamal
Amini, “Voice conversion based on feature combination with limited
training data,” Speech Communication, vol. 67, pp. 113–128, 2015.
[125] Manuel Sam Ribeiro, Junichi Yamagishi, and Robert AJ Clark, “A
perceptual investigation of wavelet-based decomposition of f0 for
text-to-speech synthesis,” in Sixteenth Annual Conference of the
International Speech Communication Association, 2015.
[126] Manuel Sam Ribeiro, Oliver Watts, Junichi Yamagishi, and Robert AJ
Clark, “Wavelet-based decomposition of f0 as a secondary task for
dnn-based speech synthesis with multi-task learning,” in 2016 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2016, pp. 5525–5529.
23
[127] Cheng-Cheng Wang, Zhen-Hua Ling, Bu-Fan Zhang, and Li-Rong
Dai, “Multi-layer f0 modeling for hmm-based speech synthesis,”
in 2008 6th International Symposium on Chinese Spoken Language
Processing. IEEE, 2008, pp. 1–4.
[128] Gerard Sanchez, Hanna Silen, Jani Nurminen, and Moncef Gabbouj,
“Hierarchical modeling of F0 contours for voice conversion, In
Proceedings of the Annual Conference of the International Speech
Communication Association, INTERSPEECH, , no. September, pp.
2318–2321, 2014.
[129] Daniel Erro, Asunción Moreno, and Antonio Bonafonte, “Voice con-
version based on weighted frequency warping,” IEEE Transactions
on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 922–
931, 2009.
[130] David Sundermann and Hermann Ney, “Vtln-based voice con-
version,” in Proceedings of the 3rd IEEE International Symposium
on Signal Processing and Information Technology (IEEE Cat. No.
03EX795). IEEE, 2003, pp. 556–559.
[131] Matthias Eichner, Matthias Wolff, and Rüdiger Hoffmann, “Voice
characteristics conversion for tts using reverse vtln,” in 2004 IEEE
International Conference on Acoustics, Speech, and Signal Processing.
IEEE, 2004, vol. 1, pp. I–17.
[132] Anna ribilová and Jiˇ ribil, “Non-linear frequency scale mapping
for voice conversion in text-to-speech system with cepstral descrip-
tion,” Speech Communication, vol. 48, no. 12, pp. 1691–1703, 2006.
[133] Robert Vích and Martin Vondra, “Pitch synchronous transform
warping in voice conversion,” in Cognitive Behavioural Systems, pp.
280–289. Springer, 2012.
[134] Elizabeth Godoy, Olivier Rosec, and Thierry Chonavel, Voice con-
version using dynamic frequency warping with amplitude scaling,
for parallel or nonparallel corpora,” IEEE Transactions on Audio,
Speech, and Language Processing, vol. 20, no. 4, pp. 1313–1323, 2011.
[135] Dd Lee and Hs Seung, Algorithms for non-negative matrix factor-
ization,” Advances in neural information processing systems, , no. 1,
pp. 556–562, 2001.
[136] Syu-Siang Wang, Alan Chern, Yu Tsao, Jeigh-Weih Hung, Xugang Lu,
Ying-Hui Lai, and Borching Su, “Wavelet speech enhancement based
on nonnegative matrix factorization,” IEEE Signal Processing Letters,
vol. 23, 2016.
[137] Nasser Mohammadiha, Paris Smaragdis, and Arne Leijon, “Super-
vised and Unsupervised Speech Enhancement Using Nonnegative
Matrix Factorization,” IEEE Transactions on Audio, Speech and
Language Processing, vol. 21, no. 10, pp. 2140–2151, 2013.
[138] K A Akarsh, “Speech Enhancement using Non negative Matrix
Factorization and Enhanced NMF,” International Conference on
Circuit, Power and Computing Technologies (ICCPCT), 2015.
[139] Kevin W Wilson, Bhiksha Raj, Paris Smaragdis, and Ajay Divakaran,
“Speech denoising using nonnegative matrix factorization with pri-
ors,” In IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), 2008.
[140] Meng Sun, Yinan Li, Jort F Gemmeke, and Xiongwei Zhang, “Speech
enhancement under low SNR conditions via noise estimation us-
ing sparse and low-rank NMF with Kullback-Leibler divergence,”
IEEE/ACM Transactions on Audio, Speech and Language Processing,
vol. 23, no. 7, pp. 1233–1242, 2015.
[141] Zhizheng Wu, Tuomas Virtanen, Tomi Kinnunen, Eng Siong Chng,
and Haizhou Li, “Examplar-Based Voice Conversion Using Non-
Negative Spectrogram Deconvolution, 8th ISCA Speech Synthesis
Workshop, 2013.
[142] Yi Chiao Wu, Hsin Te Hwang, Chin Cheng Hsu, Yu Tsao, and
Hsin Min Wang, “Locally linear embedding for exemplar-based
spectral conversion,” In Proceedings of the Annual Conference of the
International Speech Communication Association, INTERSPEECH,
vol. 08-12-September-2016, no. 1, pp. 1652–1656, 2016.
[143] Huaiping Ming, Dongyan Huang, Lei Xie, Shaofei Zhang, Minghui
Dong, and Haizhou Li, “Exemplar-based sparse representation of
timbre and prosody for voice conversion,” In IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP),
2016.
[144] Berrak ¸Si¸sman, Haizhou Li, and Kay Chen Tan, “Transformation
of prosody in voice conversion,” in 2017 Asia-Pacific Signal and
Information Processing Association Annual Summit and Conference
(APSIPA ASC). IEEE, 2017, pp. 1537–1546.
[145] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-
Min Wang, “Dictionary update for nmf-based voice conversion using
an encoder-decoder network,” 10th International Symposium on
Chinese Spoken Language Processing (ISCSLP), vol. 22, no. 3, pp.
293–297, 2016.
[146] Hermann Ney, David Suendermann, Antonio Bonafonte, and Harald
Höge, “A first step towards text-independent voice conversion,”
in Eighth International Conference on Spoken Language Processing,
2004.
[147] Hui Ye and Steve J. Young, “Voice conversion for unknown speakers,”
in INTERSPEECH 2004 - ICSLP, 8th International Conference on
Spoken Language Processing, Jeju Island, Korea, October 4-8, 2004.
2004, ISCA.
[148] Hui Ye and Steve Young, “Quality-enhanced voice morphing using
maximum likelihood transformations,” IEEE Transactions on Audio,
Speech, and Language Processing, vol. 14, pp. 1301 1312, 08 2006.
[149] Alan W Black and Nick Campbell, “Optimising selection of units
from speech databases for concatenative synthesis.,” 1995.
[150] Kei Fujii, Jun Okawa, and Kaori Suigetsu, “High individuality voice
conversion based on concatenative speech synthesis,” International
Journal of Electrical, Computer, Energetic, Electronic and Communi-
cation Engineering, vol. 1, no. 11, pp. 1617–1622, 2007.
[151] Yoshinori Sagisaka, Nobuyoshi Kaiki, Naoto Iwahashi, and Katsuhiko
Mimura, “Atr µ-talk speech synthesis system, in Second Interna-
tional Conference on Spoken Language Processing, 1992.
[152] Daniel Erro, Ferran Diego, and Antonio Bonafonte, “Voice conver-
sion of non-aligned data using unit selection,” 2006.
[153] A. Mouchtaris, J. Van der Spiegel, and P. Mueller, “Nonparallel
training for voice conversion based on a parameter adaptation
approach,” IEEE Transactions on Audio, Speech, and Language
Processing, vol. 14, no. 3, pp. 952–963, 2006.
[154] Tomoki Toda, Yamato Ohtani, and Kiyohiro Shikano, “Eigenvoice
conversion based on gaussian mixture model,” in INTERSPEECH,
2006.
[155] Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and
Pierre Ouellet, “Front-end factor analysis for speaker verification,
IEEE Transactions on Audio, Speech, and Language Processing, vol.
19, no. 4, pp. 788–798, 2010.
[156] Z. Wu, T. Kinnunen, E. S. Chng, and H. Li, “Mixture of factor ana-
lyzers using priors from non-parallel speech for voice conversion,”
IEEE Signal Processing Letters, vol. 19, no. 12, pp. 914–917, 2012.
[157] Yannis Stylianou, Olivier Cappé, and Eric Moulines, “Continuous
probabilistic transform for voice conversion,” IEEE Transactions on
Speech and Audio Processing, vol. 6, no. 2, pp. 131–142, 1998.
[158] Yi Zhou, Xiaohai Tian, Haihua Xu, Rohan Kumar Das, and Haizhou
Li, “Cross-lingual voice conversion with bilingual phonetic pos-
teriorgrams and average modeling,” International Conference on
Acoustic, Speech and Signal Processing (ICASSP), 2019.
[159] Tomi Kinnunen, Lauri Juvela, Paavo Alku, and Junichi Yamagishi,
“Non-parallel voice conversion using i-vector plda: Towards unifying
speaker verification and transformation,” in 2017 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2017, pp. 5535–5539.
[160] Hiroyuki Miyoshi, Yuki Saito, Shinnosuke Takamichi, and Hiroshi
Saruwatari, “Voice conversion using sequence-to-sequence learning
of context posterior probabilities,” arXiv preprint arXiv:1704.02360,
2017.
[161] Seung-won Park, Doo-young Kim, and Myun-chul Joe, “Cotatron:
Transcription-guided speech encoder for any-to-many voice conver-
sion without parallel data,” ArXiv, vol. abs/2005.03295, 2020.
[162] Feng-Long Xie, Yao Qian, Frank K Soong, and Haifeng Li, “Pitch
transformation in neural network based voice conversion,” in
The 9th International Symposium on Chinese Spoken Language
Processing. IEEE, 2014, pp. 197–200.
[163] Toru Nakashika, Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo
Ariki, “Voice conversion in high-order eigen space using deep belief
nets.,” in Interspeech, 2013, pp. 369–372.
[164] Seyed Hamidreza Mohammadi and Alexander Kain, Voice con-
version using deep neural networks with speaker-independent pre-
training,” in 2014 IEEE Spoken Language Technology Workshop (SLT).
IEEE, 2014, pp. 19–23.
[165] Feng-Long Xie, Yao Qian, Yuchen Fan, Frank K Soong, and Haifeng
Li, “Sequence error (se) minimization training of neural network
for voice conversion,” in Fifteenth Annual Conference of the Inter-
national Speech Communication Association, 2014.
[166] Keiichi Tokuda, Takayoshi Yoshimura, Takashi Masuko, Takao
Kobayashi, and Tadashi Kitamura, “Speech parameter generation
algorithms for hmm-based speech synthesis,” in 2000 IEEE In-
ternational Conference on Acoustics, Speech, and Signal Processing.
Proceedings (Cat. No. 00CH37100). IEEE, 2000, vol. 3, pp. 1315–1318.
24
[167] Ling-hui Chen, Zhen-hua Ling, Li-juan Liu, and Li-rong Dai, “Voice
Conversion Using Deep Neural Networks With Layer-Wise Genera-
tive Training, IEEE Transactions on Audio, Speech and Language
Processing, vol. 22, no. 12, pp. 1859–1872, 2014.
[168] Toru Nakashika, Tetsuya Takiguchi, and Yasuo Ariki, “High-order
sequence modeling using speaker-dependent recurrent temporal
restricted Boltzmann machines for voice conversion,” In Proceedings
of the Annual Conference of the International Speech Communication
Association, INTERSPEECH, , no. September, pp. 2278–2282, 2014.
[169] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term mem-
ory, Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[170] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins, “Learning
to forget: Continual prediction with lstm,” 1999.
[171] Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink,
and Jürgen Schmidhuber, “Lstm: A search space odyssey,” IEEE
transactions on neural networks and learning systems, vol. 28, no.
10, pp. 2222–2232, 2016.
[172] Huaiping Ming, Dongyan Huang, Lei Xie, Jie Wu, Minghui Dong,
and Haizhou Li, “Deep bidirectional LSTM modeling of timbre
and prosody for emotional voice conversion,” In Proceedings of
the Annual Conference of the International Speech Communication
Association, INTERSPEECH, vol. 08-12-September-2016, pp. 2453–
2457, 2016.
[173] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural
machine translation by jointly learning to align and translate, arXiv
preprint arXiv:1409.0473, 2014.
[174] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Atten-
tion is all you need,” in Advances in neural information processing
systems, 2017, pp. 5998–6008.
[175] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and
spell: A neural network for large vocabulary conversational speech
recognition,” in 2016 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), 2016, pp. 4960–4964.
[176] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J
Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen,
Samy Bengio, et al., Tacotron: Towards end-to-end speech syn-
thesis,” arXiv preprint arXiv:1703.10135, 2017.
[177] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O Arik, Ajay
Kannan, Sharan Narang, Jonathan Raiman, and John Miller, “Deep
voice 3: 2000-speaker neural text-to-speech,” arXiv preprint
arXiv:1710.07654, 2017.
[178] Hideyuki Tachibana, Katsuya Uenoyama, and Shunsuke Aihara,
“Efficiently trainable text-to-speech system based on deep convolu-
tional networks with guided attention,” in 2018 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2018, pp. 4784–4788.
[179] Jing-Xuan Zhang, Zhen-Hua Ling, Li-Juan Liu, Yuan Jiang, and
Li-Rong Dai, “Sequence-to-sequence acoustic modeling for voice
conversion,” IEEE/ACM Transactions on Audio, Speech, and Language
Processing, vol. 27, no. 3, pp. 631–644, 2019.
[180] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry
Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio,
“Learning phrase representations using rnn encoder-decoder for
statistical machine translation,” arXiv preprint arXiv:1406.1078,
2014.
[181] K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, “Atts2s-vc:
Sequence-to-sequence voice conversion with attention and context
preservation mechanisms,” in ICASSP 2019 - 2019 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP),
2019, pp. 6805–6809.
[182] Minh-Thang Luong, Hieu Pham, and Christopher D Manning, “Ef-
fective approaches to attention-based neural machine translation,”
arXiv preprint arXiv:1508.04025, 2015.
[183] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann
Dauphin, “Convolutional sequence to sequence learning,” ArXiv,
vol. abs/1705.03122, 2017.
[184] Hirokazu Kameoka, Kou Tanaka, Takuhiro Kaneko, and Nobukatsu
Hojo, “Convs2s-vc: Fully convolutional sequence-to-sequence voice
conversion,” ArXiv, vol. abs/1811.01609, 2018.
[185] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros, “Un-
paired image-to-image translation using cycle-consistent adversarial
networks,” in Proceedings of the IEEE international conference on
computer vision, 2017, pp. 2223–2232.
[186] Kenan E Ak, Joo Hwee Lim, Jo Yew Tham, and Ashraf A Kassim,
Attribute manipulation generative adversarial networks for fashion
images,” in Proceedings of the IEEE International Conference on
Computer Vision, 2019, pp. 10541–10550.
[187] Kenan E Ak, Ashraf A Kassim, Joo Hwee Lim, and Jo Yew Tham,
“Learning attribute representations with localization for flexible
fashion search,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2018, pp. 7708–7717.
[188] Kenan Emir Ak, Deep learning approaches for attribute manipulation
and text-to-image synthesis, Ph.D. thesis, 2019.
[189] Kenan E Ak, Joo Hwee Lim, Jo Yew Tham, and Ashraf A Kassim,
“Efficient multi-attribute similarity learning towards attribute-based
fashion search,” in 2018 IEEE Winter Conference on Applications of
Computer Vision (WACV). IEEE, 2018, pp. 1671–1679.
[190] Kenan E Ak, Ning Xu, Zhe Lin, and Yilin Wang, “Incorporating
reinforced adversarial learning in autoregressive image generation,
arXiv preprint arXiv:2007.09923, 2020.
[191] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio,
“Generative adversarial nets,” in Advances in neural information
processing systems, 2014, pp. 2672–2680.
[192] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz, “Mul-
timodal unsupervised image-to-image translation,” in Proceedings
of the European Conference on Computer Vision (ECCV), 2018, pp.
172–189.
[193] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A
Efros, Oliver Wang, and Eli Shechtman, “Toward multimodal image-
to-image translation,” in Advances in neural information processing
systems, 2017, pp. 465–476.
[194] Kenan E Ak, Joo Hwee Lim, Jo Yew Tham, and Ashraf A Kas-
sim, “Semantically consistent text to fashion image synthesis with
an enhanced attentional generative adversarial network,” Pattern
Recognition Letters, 2020.
[195] Kenan Emir Ak, Joo Hwee Lim, Jo Yew Tham, and Ashraf Kassim,
“Semantically consistent hierarchical text to fashion image synthesis
with an enhanced-attentional generative adversarial network,” in
Proceedings of the IEEE International Conference on Computer Vision
Workshops, 2019, pp. 0–0.
[196] Zhong Meng, Jinyu Li, Yifan Gong, and Biing-Hwang (Fred) Juang,
“Cycle-Consistent Speech Enhancement, INTERSPEECH, 2018.
[197] Masato Mimura, Shinsuke Sakai, and Tatsuya Kawahara, “Cross-
domain speech recognition using nonparallel corpora with cycle-
consistent adversarial networks,” IEEE Automatic Speech Recognition
and Understanding Workshop (ASRU), 2017.
[198] Dongsuk Yook, In-Chul Yoo, and Seungho Yoo, “Voice conversion
using conditional cyclegan,” in 2018 International Conference on
Computational Science and Computational Intelligence (CSCI). IEEE,
2018, pp. 1460–1461.
[199] Sicong Huang, Qiyang Li, Cem Anil, Xuchan Bao, Sageev Oore,
and Roger B Grosse, “Timbretron: A wavenet (cyclegan (cqt
(audio))) pipeline for musical timbre transfer, arXiv preprint
arXiv:1811.09620, 2018.
[200] Takuhiro Kaneko and Hirokazu Kameoka, “Cyclegan-vc: Non-parallel
voice conversion using cycle-consistent adversarial networks,” in
2018 26th European Signal Processing Conference (EUSIPCO). IEEE,
2018, pp. 2100–2104.
[201] Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu
Hojo, “Cyclegan-vc2: Improved cyclegan-based non-parallel voice
conversion,” in ICASSP 2019-2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp.
6820–6824.
[202] Yaniv Taigman, Adam Polyak, and Lior Wolf, “Unsupervised cross-
domain image generation,” ArXiv abs/1611.02200, 2016.
[203] Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, Kazuhiro
Kobayashi, and Tomoki Toda, Voice conversion with cyclic recur-
rent neural network and fine-tuned wavenet vocoder,” in ICASSP
2019-2019 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). IEEE, 2019, pp. 6815–6819.
[204] Berrak Sisman, Mingyang Zhang, Minghui Dong, and Haizhou Li,
“On the study of generative adversarial networks for cross-lingual
voice conversion,” in 2019 IEEE Automatic Speech Recognition and
Understanding Workshop (ASRU). IEEE, 2019, pp. 144–151.
[205] Kun Zhou, Berrak Sisman, and Haizhou Li, “Transforming spec-
trum and prosody for emotional voice conversion with non-parallel
training data,” arXiv preprint arXiv:2002.00198, 2020.
[206] Kun Zhou, Berrak Sisman, Mingyang Zhang, and Haizhou Li, “Con-
verting anyone’s emotion: Towards speaker-independent emotional
voice conversion,” arXiv preprint arXiv:2005.07025, 2020.
25
[207] Cheng-chieh Yeh, Po-chun Hsu, Ju-chieh Chou, Hung-yi Lee, and
Lin-shan Lee, “Rhythm-flexible voice conversion without parallel
data using cycle-gan over phoneme posteriorgram sequences, in
2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018,
pp. 274–281.
[208] Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, and Haizhou
Li, “Wavetts: Tacotron-based tts with joint time-frequency domain
loss,” arXiv preprint arXiv:2002.00417, 2020.
[209] Wen-Chin Huang, Tomoki Hayashi, Yi-Chiao Wu, Hirokazu
Kameoka, and Tomoki Toda, “Voice transformer network: Sequence-
to-sequence voice conversion using transformer with text-to-speech
pretraining,” arXiv preprint arXiv:1912.06813, 2019.
[210] Jing-Xuan Zhang, Zhen-Hua Ling, Yuan Jiang, Li-Juan Liu, Chen
Liang, and Li-Rong Dai, “Improving sequence-to-sequence voice
conversion by adding text-supervision,” in ICASSP 2019-2019 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2019, pp. 6785–6789.
[211] Hieu-Thi Luong and Junichi Yamagishi, “Bootstrapping non-parallel
voice conversion from speaker-adaptive text-to-speech,” in 2019
IEEE Automatic Speech Recognition and Understanding Workshop
(ASRU). IEEE, 2019, pp. 200–207.
[212] Hieu-Thi Luong and Junichi Yamagishi, “Nautilus: a versatile voice
cloning system,” arXiv preprint arXiv:2005.11004, 2020.
[213] Fadi Biadsy, Ron J Weiss, Pedro J Moreno, Dimitri Kanvesky, and
Ye Jia, “Parrotron: An end-to-end speech-to-speech conversion
model and its applications to hearing-impaired speech and speech
separation,” arXiv preprint arXiv:1904.04169, 2019.
[214] Songxiang Liu, Yuewen Cao, and Helen Meng, “Multi-target emo-
tional voice conversion with neural vocoders,” arXiv preprint
arXiv:2004.03782, 2020.
[215] Mingyang Zhang, Berrak Sisman, Sai Sirisha Rallabandi, Haizhou
Li, and Li Zhao, “Error reduction network for dblstm-based voice
conversion,” in 2018 Asia-Pacific Signal and Information Processing
Association Annual Summit and Conference (APSIPA ASC). IEEE,
2018, pp. 823–828.
[216] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo
Larochelle, and Ole Winther, “Autoencoding beyond pixels using
a learned similarity metric,” Proceedings of The 33rd International
Conference on Machine Learning, PMLR, 2016.
[217] Ju-Chieh Chou, Cheng chieh Yeh, and Hung yi Lee, “One-shot voice
conversion by separating speaker and content representations with
instance normalization,” ArXiv, vol. abs/1904.05742, 2019.
[218] Da-Yi Wu and Hung-yi Lee, “One-shot voice conversion by vector
quantization,” in ICASSP 2020-2020 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp.
7734–7738.
[219] Da-Yi Wu, Yen-Hao Chen, and Hung-Yi Lee, “Vqvc+: One-shot voice
conversion by vector quantization and u-net architecture, arXiv
preprint arXiv:2006.04154, 2020.
[220] Diederik P Kingma and Max Welling, “Auto-encoding variational
bayes,” arXiv preprint arXiv:1312.6114, 2013.
[221] Shaojin Ding and Ricardo Gutierrez-Osuna, “Group latent embed-
ding for vector quantized variational autoencoder in non-parallel
voice conversion.,” in INTERSPEECH, 2019, pp. 724–728.
[222] Wen-Chin Huang, Hsin-Te Hwang, Yu-Huai Peng, Yu Tsao, and Hsin-
Min Wang, “Voice conversion based on cross-domain features using
variational auto encoders,” in 2018 11th International Symposium
on Chinese Spoken Language Processing (ISCSLP). IEEE, 2018, pp.
51–55.
[223] Yanping Li, Kong Aik Lee, Yougen Yuan, Haizhou Li, and Zhen Yang,
“Many-to-many voice conversion based on bottleneck features with
variational autoencoder for non-parallel training data,” in 2018
Asia-Pacific Signal and Information Processing Association Annual
Summit and Conference (APSIPA ASC). IEEE, 2018, pp. 829–833.
[224] Yuki Saito, Yusuke Ijima, Kyosuke Nishida, and Shinnosuke
Takamichi, “Non-parallel voice conversion using variational autoen-
coders conditioned by phonetic posteriorgrams and d-vectors,” in
2018 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 2018, pp. 5274–5278.
[225] Wen-Chin Huang, Hao Luo, Hsin-Te Hwang, Chen-Chou Lo, Yu-
Huai Peng, Yu Tsao, and Hsin-Min Wang, “Unsupervised represen-
tation disentanglement using cross domain features and adversarial
learning in variational autoencoder based voice conversion,” IEEE
Transactions on Emerging Topics in Computational Intelligence, p.
1–12, 2020.
[226] Songxiang Liu, Yuewen Cao, Shiyin Kang, Na Hu, Xunying Liu, Dan
Su, Dong Yu, and Helen Meng, “Transferring source style in non-
parallel voice conversion,” arXiv preprint arXiv:2005.09178, 2020.
[227] R. Kubichek, “Mel-cepstral distance measure for objective speech
quality assessment,” Communications, Computers and Signal Pro-
cessing, pp. 125–128, 1993.
[228] Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen,
“Pearson correlation coefficient, in Noise reduction in speech
processing, pp. 1–4. Springer, 2009.
[229] Tianfeng Chai and Roland R Draxler, “Root mean square error (rmse)
or mean absolute error (mae)?–arguments against avoiding rmse in
the literature,” Geoscientific model development, vol. 7, no. 3, pp.
1247–1250, 2014.
[230] Cort J Willmott and Kenji Matsuura, Advantages of the mean
absolute error (mae) over the root mean square error (rmse) in
assessing average model performance,” Climate research, vol. 30,
no. 1, pp. 79–82, 2005.
[231] Volodya Grancharov and W Bastiaan Kleijn, “Speech quality as-
sessment,” in Springer handbook of speech processing, pp. 83–100.
Springer, 2008.
[232] Robert C Streijl, Stefan Winkler, and David S Hands, “Mean opinion
score (mos) revisited: methods and applications, limitations and
alternatives,” Multimedia Systems, vol. 22, no. 2, pp. 213–227, 2016.
[233] Min Chu, Hu Peng, and Yong Zhao, “Optimization of an objective
measure for estimating mean opinion score of synthesized speech,”
June 10 2008, US Patent 7,386,451.
[234] Mahesh Viswanathan and Madhubalan Viswanathan, “Measuring
speech quality for text-to-speech systems: development and assess-
ment of a modified mean opinion score (mos) scale,” Computer
Speech & Language, vol. 19, no. 1, pp. 55–83, 2005.
[235] Alexander Kain and Michael W Macon, “Design and evaluation of
a voice conversion algorithm based on spectral envelope mapping
and residual prediction,” in 2001 IEEE International Conference
on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.
01CH37221). IEEE, 2001, vol. 2, pp. 813–816.
[236] Terry N. Flynn and Anthony A. J. Marley, “Best worst scaling:
Theory and methods,” Handbook of choice modelling, Edward Elgar
Publishing, pp. 178–201, 2014.
[237] Tomoki Toda, Ling-Hui Chen, Daisuke Saito, Fernando Villavicencio,
Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi, “The Voice
Conversion Challenge 2016,” In INTERSPEECH, pp. 1632–1636, 2016.
[238] Mingyang Zhang, Berrak Sisman, Li Zhao, and Haizhou Li, “Deep-
conversion: Voice conversion with limited parallel training data,”
Speech Communication, 2020.
[239] Jiahao Lai, Bo Chen, Tian Tan, Sibo Tong, and Kai Yu, “Phone-aware
lstm-rnn for voice conversion,” in 2016 IEEE 13th International
Conference on Signal Processing (ICSP). IEEE, 2016, pp. 177–182.
[240] Alan W Black, H Timothy Bunnell, Ying Dou, Prasanna Kumar
Muthukumar, Florian Metze, Daniel Perry, Tim Polzehl, Kishore
Prahallad, Stefan Steidl, and Callie Vaughn, “Articulatory features for
expressive speech synthesis,” in 2012 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2012, pp.
4005–4008.
[241] Beth Logan et al., “Mel frequency cepstral coefficients for music
modeling.,” in Ismir, 2000, vol. 270, pp. 1–11.
[242] Chitralekha Gupta, Haizhou Li, and Ye Wang, “Perceptual evaluation
of singing quality, in 2017 Asia-Pacific Signal and Information
Processing Association Annual Summit and Conference (APSIPA ASC),
2017, pp. 577–586.
[243] Wei Chu and Abeer Alwan, “Reducing f0 frame error of f0 tracking
algorithms under noisy conditions with an unvoiced/voiced classifi-
cation frontend,” in 2009 IEEE International Conference on Acoustics,
Speech and Signal Processing. IEEE, 2009, pp. 3969–3972.
[244] Tomohiro Nakatani, Shigeaki Amano, Toshio Irino, Kentaro Ishizuka,
and Tadahisa Kondo, “A method for fundamental frequency estima-
tion and voicing decision: Application to infant utterances recorded
in real acoustical environments,” Speech Communication, vol. 50,
no. 3, pp. 203–214, 2008.
[245] RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy
Stanton, Joel Shor, Ron J Weiss, Rob Clark, and Rif A Saurous, To-
wards end-to-end prosody transfer for expressive speech synthesis
with tacotron,” arXiv preprint arXiv:1803.09047, 2018.
[246] Berrak Sisman, Grandee Lee, Haizhou Li, and Kay Chen Tan, “On
the analysis and evaluation of prosody conversion techniques,” in
2017 International Conference on Asian Language Processing (IALP).
IEEE, 2017, pp. 44–47.
26
[247] Tomomi Watanabe, Takahiro Murakami, Munehiro Namba, Tetsuya
Hoya, and Yoshihisa Ishida, “Transformation of spectral envelope
for voice conversion based on radial basis function networks,” in
Seventh international conference on spoken language processing,
2002.
[248] Kazuhiro Kobayashi, Shinnosuke Takamichi, Satoshi Nakamura, and
Tomoki Toda, “The nu-naist voice conversion system for the voice
conversion challenge 2016.,” in Interspeech, 2016, pp. 1667–1671.
[249] B Ramani, MP Actlin Jeeva, P Vijayalakshmi, and T Nagarajan,
“Cross-lingual voice conversion-based polyglot speech synthesizer
for indian languages,” in Fifteenth annual conference of the inter-
national speech communication association, 2014.
[250] Oytun Turk and Levent M Arslan, “Robust processing techniques
for voice conversion,” Computer Speech & Language, vol. 20, no. 4,
pp. 441–467, 2006.
[251] Srinivas Desai, Alan W Black, B Yegnanarayana, and Kishore Prahal-
lad, “Spectral mapping using artificial neural networks for voice
conversion,” IEEE Transactions on Audio, Speech, and Language
Processing, vol. 18, no. 5, pp. 954–964, 2010.
[252] Masatsune Tamura, Takashi Masuko, Keiichi Tokuda, and Takao
Kobayashi, “Speaker adaptation for hmm-based speech synthesis
system using mllr, in the third ESCA/COCOSDA Workshop (ETRW)
on Speech Synthesis, 1998.
[253] Volodya Grancharov, David Yuheng Zhao, Jonas Lindblom, and
W Bastiaan Kleijn, “Low-complexity, nonintrusive speech quality
assessment,” IEEE Transactions on Audio, Speech, and Language
Processing, vol. 14, no. 6, pp. 1948–1956, 2006.
[254] Mirjam Wester, Cassia Valentini-Botinhao, and Gustav Eje Henter,
Are we using enough listeners? no!—an empirically-supported cri-
tique of interspeech 2014 tts evaluations,” in Sixteenth Annual
Conference of the International Speech Communication Association,
2015.
[255] Slawomir Zielinski, Philip Hardisty, Christopher Hummersone, and
Francis Rumsey, “Potential biases in mushra listening tests,” in Au-
dio Engineering Society Convention 123. Audio Engineering Society,
2007.
[256] Hadas Benisty and David Malah, Voice conversion using gmm with
enhanced global variance,” in Twelfth Annual Conference of the
International Speech Communication Association, 2011.
[257] Jakub Vít, Zdenˇek Hanzlíˇcek, and Jindˇrich Matoušek, “On the
analysis of training data for wavenet-based speech synthesis,” in
2018 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 2018, pp. 5684–5688.
[258] Meng Zhang, Jianhua Tao, Jilei Tian, and Xia Wang, “Text-
independent voice conversion based on state mapped codebook,” in
2008 IEEE International Conference on Acoustics, Speech and Signal
Processing. IEEE, 2008, pp. 4605–4608.
[259] ITUR Recommendation, “1534-1, Method for the subjective as-
sessment of intermediate sound quality (MUSHRA),” International
Telecommunications Union, Geneva, Switzerland, 2001.
[260] Antony W Rix, John G Beerends, Michael P Hollier, and Andries P
Hekstra, “Perceptual evaluation of speech quality (pesq)-a new
method for speech quality assessment of telephone networks and
codecs,” in 2001 IEEE International Conference on Acoustics, Speech,
and Signal Processing. Proceedings (Cat. No. 01CH37221). IEEE, 2001,
vol. 2, pp. 749–752.
[261] Szu-Wei Fu, Yu Tsao, Hsin-Te Hwang, and Hsin-Min Wang, “Quality-
net: An end-to-end non-intrusive speech quality assessment model
based on blstm,” arXiv preprint arXiv:1808.05344, 2018.
[262] Takenori Yoshimura, Gustav Eje Henter, Oliver Watts, Mirjam Wester,
Junichi Yamagishi, and Keiichi Tokuda, “A hierarchical predictor of
synthetic speech naturalness using neural networks.,” in INTER-
SPEECH, 2016, pp. 342–346.
[263] Brian Patton, Yannis Agiomyrgiannakis, Michael Terry, Kevin Wilson,
Rif A Saurous, and D Sculley, “Automos: Learning a non-intrusive
assessor of naturalness-of-speech,” arXiv preprint arXiv:1611.09207,
2016.
[264] Milos Cernak and Milan Rusko, “An evaluation of synthetic speech
using the pesq measure,” in Proc. European Congress on Acoustics,
2005, pp. 2725–2728.
[265] Dong-Yan Huang, “Prediction of perceived sound quality of syn-
thetic speech,” Proc. APSIPA, 2011.
[266] Ulpu Remes, Reima Karhila, and Mikko Kurimo, “Objective evalu-
ation measures for speaker-adaptive hmm-tts systems, in Eighth
ISCA Workshop on Speech Synthesis, 2013.
[267] Chen-Chou Lo, Szu-Wei Fu, Wen-Chin Huang, Xin Wang, Junichi
Yamagishi, Yu Tsao, and Hsin-Min Wang, “Mosnet: Deep learning
based objective assessment for voice conversion,” arXiv preprint
arXiv:1904.08352, 2019.
[268] Jennifer Williams, Joanna Rownicka, Pilar Oplustil, and Simon King,
“Comparison of speech representations for automatic quality esti-
mation in multi-speaker text-to-speech synthesis,” arXiv preprint
arXiv:2002.12645, 2020.
[269] Tomoki Toda, Ling-Hui Chen, Daisuke Saito, Fernando Villavicencio,
Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi, “The voice
conversion challenge 2016,” in Interspeech 2016, 2016, pp. 1632–
1636.
[270] Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke
Saito, Fernando Villavicencio, Tomi Kinnunen, and Zhenhua Ling,
“The voice conversion challenge 2018: Promoting development of
parallel and nonparallel methods,” in Proc. Odyssey 2018 The Speaker
and Language Recognition Workshop, 2018, pp. 195–202.
[271] Zhizheng Wu, Nicholas Evans, Tomi Kinnunen, Junichi Yamagishi,
Federico Alegre, and Haizhou Li, “Spoofing and countermeasures
for speaker verification: A survey, Speech Communication, vol. 66,
pp. 130 153, 2015.
[272] Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi, Analysis of the
voice conversion challenge 2016 evaluation results,” in Interspeech
2016, 2016, pp. 1637–1641.
[273] Kazuhiro Kobayashi, Shinnosuke Takamichi, Satoshi Nakamura, and
Tomoki Toda, “The nu-naist voice conversion system for the voice
conversion challenge 2016,” in Interspeech 2016, 2016, pp. 1667–
1671.
[274] Yichiao Wu, Patrick Lumban Tobing, Tomoki Hayashi, Kazuhiro
Kobayashi, and Tomoki Toda, “The nu non-parallel voice conversion
system for the voice conversion challenge 2018,” in Proc. Odyssey
2018 The Speaker and Language Recognition Workshop, 2018, pp.
211–218.
[275] Li-Juan Liu, Zhen-Hua Ling, Yuan Jiang, Ming Zhou, and Li-Rong
Dai, Wavenet vocoder with limited training data for voice conver-
sion,” in Proc. Interspeech 2018, 2018, pp. 1983–1987.
[276] J. Zhang, Z. Ling, L. Liu, Y. Jiang, and L. Dai, “Sequence-to-sequence
acoustic modeling for voice conversion,” IEEE/ACM Transactions on
Audio, Speech, and Language Processing, vol. 27, no. 3, pp. 631–644,
2019.
[277] J. Zhang, Z. Ling, and L. Dai, “Non-parallel sequence-to-sequence
voice conversion with disentangled linguistic and speaker represen-
tations,” IEEE/ACM Transactions on Audio, Speech, and Language
Processing, vol. 28, pp. 540–552, 2020.
[278] Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi,
Cemal Hanilçi, Md. Sahidullah, and Aleksandr Sizov, “ASVspoof 2015:
the first automatic speaker verification spoofing and countermea-
sures challenge,” in Proc. Interspeech, 2015, pp. 2037–2041.
[279] Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilçi, M. Sahidullah, A. Sizov,
N. Evans, M. Todisco, and H. Delgado, “Asvspoof: The automatic
speaker verification spoofing and countermeasures challenge,” IEEE
Journal of Selected Topics in Signal Processing, vol. 11, no. 4, pp. 588–
604, 2017.
[280] Tomi Kinnunen, Md. Sahidullah, Héctor Delgado, Massimiliano
Todisco, Nicholas Evans, Junichi Yamagishi, and Kong-Aik Lee, “The
ASVspoof 2017 challenge: assessing the limits of replay spoofing
attack detection,” in Proc. Interspeech, 2017, pp. 2–6.
[281] Massimiliano Todisco, Xin Wang, Ville Vestman, Md. Sahidullah,
Héctor Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas
Evans, Tomi H. Kinnunen, and Kong Aik Lee, “ASVspoof 2019: future
horizons in spoofed and fake audio detection,” in Proc. Interspeech,
2019, pp. 1008–1012.
[282] Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Hector Delgado,
Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman,
Tomi Kinnunen, Kong Aik Lee, Lauri Juvela, Paavo Alku, Yu-Huai
Peng, Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, Sebastien Le Ma-
guer, Markus Becker, Fergus Henderson, Rob Clark, Yu Zhang, Quan
Wang, Ye Jia, Kai Onuma, Koji Mushika, Takashi Kaneda, Yuan Jiang,
Li-Juan Liu, Yi-Chiao Wu, Wen-Chin Huang, Tomoki Toda, Kou
Tanaka, Hirokazu Kameoka, Ingmar Steiner, Driss Matrouf, Jean-
Francois Bonastre, Avashna Govender, Srikanth Ronanki, Jing-Xuan
Zhang, and Zhen-Hua Ling, Asvspoof 2019: a large-scale public
database of synthetic, converted and replayed speech,” 2019.
[283] John Kominek and Alan W Black, “The cmu arctic speech databases,”
in Fifth ISCA workshop on speech synthesis, 2004.
[284] Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald, et al.,
“Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning
toolkit,” 2016.
27
[285] Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia,
Zhifeng Chen, and Yonghui Wu, “LibriTTS: A Corpus Derived from
LibriSpeech for Text-to-Speech,” in Proc. Interspeech 2019, 2019, pp.
1526–1530.
[286] Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman,
“Voxceleb: Large-scale speaker verification in the wild, Computer
Speech & Language, vol. 60, pp. 101027, 2020.
[287] Kazuhiro Kobayashi and Tomoki Toda, “sprocket: Open-source
voice conversion software,” in Proc. Odyssey 2018 The Speaker and
Language Recognition Workshop, 2018, pp. 203–210.
[288] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro
Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann,
Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsub-
asa Ochiai, “Espnet: End-to-end speech processing toolkit,” in Proc.
Interspeech 2018, 2018, pp. 2207–2211.
Berrak Sisman received her PhD degree in Elec-
trical and Computer Engineering from National
University of Singapore in 2020, fully funded by
A*STAR Graduate Academy under Singapore Inter-
national Graduate Award (SINGA). She is currently
an Assistant Professor at Singapore University of
Technology and Design (SUTD). She is also an
Affiliated Researcher at the National University of
Singapore (NUS). Prior to joining SUTD, she was
a Postdoctoral Research Fellow at the National
University of Singapore. She was also an exchange
PhD student at the University of Edinburgh and a visiting scholar at
The Centre for Speech Technology Research, University of Edinburgh in
2019. She was attached to RIKEN Advanced Intelligence Project, Japan in
2018. Her research interests include speech information processing, ma-
chine learning, speech synthesis and voice conversion. She has published
in leading journals and conferences, including IEEE/ACM Transactions
on Audio, Speech and Language Processing, ASRU, INTERSPEECH and
ICASSP. She has served as the Local Arrangement Co-chair of IEEE ASRU
2019, Chair of Young Female Researchers Mentoring @ASRU2019, and
Chair of the INTERSPEECH Student Events in 2018 and 2019.
Junichi Yamagishi received the Ph.D. degree from
the Tokyo Institute of Technology (Tokyo Tech),
Tokyo, Japan, in 2006. He is currently a Professor
with the National Institute of Informatics, Tokyo,
Japan, and also a Senior Research Fellow with The
Centre for Speech Technology Research, The Uni-
versity of Edinburgh, Edinburgh, UK. Since 2006,
he has authored or co-authored over 250 refereed
papers in international journals and conferences.
Prof. Yamagishi was a recipient of the Tejima Prize
as the best Ph.D. thesis of Tokyo Tech in 2007. He
received the Itakura Prize from the Acoustic Society of Japan in 2010,
the Kiyasu Special Industrial Achievement Award from the Information
Processing Society of Japan in 2013, the Young Scientists’ Prize from the
Minister of Education, Science and Technology in 2014, the JSPS Prize
from the Japan Society for the Promotion of Science in 2016, and the
17th DOCOMO Mobile Science Award from the Mobile Communication
Fund, Japan in 2018. He was one of the organizers for special sessions
on Spoofing and Countermeasures for the Automatic Speaker Verification
at INTERSPEECH 2013, the 1st/2nd/3rd ASVspoof Evaluation, the Voice
Conversion Challenge 2016/2018/2020, and the VoicePrivacy Challenge
2020. He was an Associate Editor of the IEEE/ACM Transactions on Audio,
Speech, and Language Processing, a Lead Guest Editor of the IEEE Journal
of Selected Topics in Signal Processing Special Issue on Spoofing and
Countermeasures for Automatic Speaker Verification, and a member of
the Technical Committee of the IEEE Signal Processing Society Speech
and Language. He is now the Chairperson of ISCA Special Interest Group:
Speech Synthesis (SynSig), a member of the Technical Committee for the
Asia-Pacific Signal and Information Processing Association Multimedia
Security and Forensics, an IEEE Senior Area Editor of the IEEE/ACM
Transaction on Audio, Speech, and Language Processing.
Simon King (M’95–SM’08–F’15) received the M.A.
(Cantab) and M.Phil. degrees from the University
of Cambridge, Cambridge, U.K., and the Ph.D.
degree from University of Edinburgh, Edinburgh,
U.K. He has been with the Centre for Speech
Technology Research, University of Edinburgh,
since 1993, where he is now Professor of Speech
Processing and the Director of the Centre. His
research interests include speech synthesis, recog-
nition and signal processing and he has around
230 publications across these areas. He has served
on the ISCA SynSIG Board and currently co-organises the Blizzard Chal-
lenge. He has previously served on the IEEE SLTC and as an Associate
Editor of the IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE
PROCESSING, and is currently an Associate Editor of Computer Speech
and Language.
Haizhou Li (M’91-SM’01-F’14) received the B.Sc.,
M.Sc., and Ph.D degree in electrical and elec-
tronic engineering from South China University of
Technology, Guangzhou, China in 1984, 1987, and
1990 respectively. Dr Li is currently a Professor at
the Department of Electrical and Computer Engi-
neering, National University of Singapore (NUS).
His research interests include automatic speech
recognition, speaker and language recognition,
and natural language processing. Prior to joining
NUS, he taught in the University of Hong Kong
(1988-1990) and South China University of Technology (1990-1994). He
was a Visiting Professor at CRIN in France (1994-1995), Research Manager
at the Apple-ISS Research Centre (1996-1998), Research Director in Lernout
& Hauspie Asia Pacific (1999-2001), Vice President in InfoTalk Corp. Ltd.
(2001-2003), and the Principal Scientist and Department Head of Human
Language Technology in the Institute for Infocomm Research, Singapore
(2003-2016). Dr Li served as the Editor-in-Chief of IEEE/ACM Transactions
on Audio, Speech and Language Processing (2015-2018), a Member of the
Editorial Board of Computer Speech and Language (2012-2018), an elected
Member of IEEE Speech and Language Processing Technical Committee
(2013-2015), the President of the International Speech Communication As-
sociation (2015-2017), the President of Asia Pacific Signal and Information
Processing Association (2015-2016), and the President of Asian Federation
of Natural Language Processing (2017-2018). He was the General Chair of
ACL 2012, INTERSPEECH 2014 and ASRU 2019. Dr Li is a Fellow of the IEEE
and the ISCA. He was a recipient of the National Infocomm Award 2002
and the President’s Technology Award 2013 in Singapore. He was named
one of the two Nokia Visiting Professors in 2009 by the Nokia Foundation,
and U Bremen Excellence Chair Professor in 2019.
... Driven by the recent resurgence of deep learning, synthesized speech is already comparable to speech sounds spoken by humans [1,2]. Deep learning technologies also allow for converting voice identity and speaking style without introducing severe deterioration in their naturalness and quality [2][3][4]. Synthesized speech and converted speech sounds based on those technologies seem now divergent enough to cover the divergence of speech by humans. However, controlling these divergent speech sounds using physiologically and perceptually grounded relevant attributes is still challenging [1][2][3][4]. ...
... Synthesized speech and converted speech sounds based on those technologies seem now divergent enough to cover the divergence of speech by humans. However, controlling these divergent speech sounds using physiologically and perceptually grounded relevant attributes is still challenging [1][2][3][4]. ...
... We implemented the tools using a classical non-neural VOCODER called WORLD [6]. We designed the tools to be upper compatible with the widely used classical VOCODER STRAIGHT [7,8] because a considerable amount of publications has been using STRAIGHT and STRAIGHT-based applications [3,[10][11][12][13][14]. ...
Article
Full-text available
We generalized a voice morphing algorithm capable of handling temporally variable, multiple-attributes, and multiple instances. The generalized morphing provides a new strategy for investigating speech diversity. However, excessive complexity and the difficulty of preparation have prevented researchers and students from enjoying its benefits. To address this issue, we introduced a set of interactive tools to make preparation and tests less cumbersome. These tools are integrated into our previously reported interactive tools as extensions. The introduction of the extended tools in lessons in graduate education was successful. Finally, we outline further extensions to explore excessively complex morphing parameter settings.
... Tapas Si: Department of Computer Science and Engineering, AI Innovation Lab, University of Engineering and Management, Jaipur, Rajasthan 303807, India, e-mail: c2.tapas@gmail.com  speaker, all while retaining the underlying linguistic content of the utterance [1][2][3][4]. However, recent advancements have taken this concept further by incorporating visual cues into the VC process, showcasing the potential to enhance the clarity and understandability of the synthesized speech significantly. ...
... Its primary objective is the manipulation of the distinct vocal attributes of a speaker, all the while upholding the essential linguistic essence and preserving the inherent natural quality. With the progressive development of research, numerous methodologies and strategies have been applied with the goal of generating synthetic speech of the utmost caliber [3,28]. In the primary stages of VC exploration, it relied upon the realm of statistical methodologies. ...
Article
Full-text available
Audio–visual speech synthesis (AVSS) is a rapidly growing field in the paradigm of audio–visual learning, involving the conversion of one person’s speech into the audio–visual stream of another while preserving the speech content. AVSS comprises two primary components: voice conversion (VC), which alters the vocal characteristics from the source speaker to the target speaker, followed by audio–visual synthesis, which creates the audio–visual presentation of the converted VC output for the target speaker. Despite the progress in deep learning (DL) technologies, DL models in AVSS have received limited attention in existing literature. Therefore, this article presents a novel approach for AVSS utilizing capsule network (Caps-Net)-based autoencoders, with the incorporation of cycle consistency loss. Caps-Net addresses translation invariance issues in convolutional neural network approaches for effective feature capture. Additionally, the inclusion of cycle consistency loss ensures the retention of content information from the source speaker. The proposed approach is referred to as cycle consistency loss-based capsule autoencoders for audio–visual speech synthesis (CCLCap-AE-AVSS). The proposed CCLCap-AE-AVSS is trained and tested using VoxCeleb2 and LRS3-TED datasets. The subjective and objective assessments of the generated samples demonstrate the superior performance of the proposed work compared to the current state-of-the-art models.
... Concerning the latter, each human has a unique voice profile, which other humans 4 and automatic systems 5 use to identify the individual person. Most recent voice synthesizing algorithms have become powerful, allowing the creation of deepfake clones that mimic identity features of natural speakers with a high level of quality and similarity 6 . Although a large research effort is underway to develop computer algorithms for automatic deepfake generation and detection, little is known about the human ability to reliably recognize socially relevant identity information in audio [7][8][9][10][11] and visual [12][13][14] deepfakes. ...
Article
Full-text available
Deepfakes are viral ingredients of digital environments, and they can trick human cognition into misperceiving the fake as real. Here, we test the neurocognitive sensitivity of 25 participants to accept or reject person identities as recreated in audio deepfakes. We generate high-quality voice identity clones from natural speakers by using advanced deepfake technologies. During an identity matching task, participants show intermediate performance with deepfake voices, indicating levels of deception and resistance to deepfake identity spoofing. On the brain level, univariate and multivariate analyses consistently reveal a central cortico-striatal network that decoded the vocal acoustic pattern and deepfake-level (auditory cortex), as well as natural speaker identities (nucleus accumbens), which are valued for their social relevance. This network is embedded in a broader neural identity and object recognition network. Humans can thus be partly tricked by deepfakes, but the neurocognitive mechanisms identified during deepfake processing open windows for strengthening human resilience to fake information.
... Most of these works on voice conversion generally consist of three steps: analysis, mapping, and reconstruction [14]. In the analysis step, feature vectors are extracted, which are easy to process while retaining relevant information from the input waveform. ...
Article
Full-text available
Voice conversion is the task of changing the speaker characteristics of input speech while preserving its linguistic content. It can be used in various areas, such as entertainment, medicine, and education. The quality of the converted speech is crucial for voice conversion algorithms to be useful in these various applications. Deep learning-based voice conversion algorithms, which have been showing promising results recently, generally consist of three modules: a feature extractor, feature converter, and vocoder. The feature extractor accepts the waveform as the input and extracts speech feature vectors for further processing. These speech feature vectors are later synthesized back into waveforms by the vocoder. The feature converter module performs the actual voice conversion; therefore, many previous studies separately focused on improving this module. These works combined the separately trained vocoder to synthesize the final waveform. Since the feature converter and the vocoder are trained independently, the output of the converter may not be compatible with the input of the vocoder, which causes performance degradation. Furthermore, most voice conversion algorithms utilize mel-spectrogram-based speech feature vectors without modification. These feature vectors have performed well in a variety of speech-processing areas but could be further optimized for voice conversion tasks. To address these problems, we propose a novel wave-to-wave (wav2wav) voice conversion method that integrates the feature extractor, the feature converter, and the vocoder into a single module and trains the system in an end-to-end manner. We evaluated the efficiency of the proposed method using the VCC2018 dataset.
Article
It is widely acknowledged that distinguishing genuine speech from spoofed speech encompasses various subbands and temporal segments within speech signals. However, prevailing spoofing detection methods tend to oversimplify the relationships between these cues by employing linear models. In this paper, we introduce a multi-level information aggregation Graph Attention Networks (MiaGATs) to generate highly discriminative features for fake speech detection (FSD). In MiaGATs, each subband and temporal segment of a speech signal is represented as distinct nodes. MiaGATs incorporates channel information aggregation within each node to effectively harness the unique spectral and temporal characteristics during the feature encoding stage. In particular, MiaGATs address the interactions between nodes through indirect node aggregation and integrates both indirect and direct node aggregation by max-pooling operation. Experimental results on ASVspoof2019 and ASVspoof2021 LA databases show significant relative improvement compared to the current state-of-the-art. In comparison to the leading integrated spectro-temporal graph attention networks, MiaGATs gains an impressive performance improvement in various conditions, underscoring MiaGATs's position as a new benchmark in spoofing detection performance.