Conference PaperPDF Available

A Fast Griffin–Lim Algorithm

October 2013

October 2013

DOI:10.1109/WASPAA.2013.6701851

Conference: Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on

Authors:

Nathanaël Perraudin

ETH Zurich

Peter Balazs

Austrian Academy of Sciences (OeAW)

Peter L. Søndergaard

Austrian Academy of Sciences (OeAW)

In this paper, we present a new algorithm to estimate a signal from its short-time Fourier transform modulus (STFTM). This algorithm is computationally simple and is obtained by an acceleration of the well-known Griffin-Lim algorithm (GLA). Before deriving the algorithm, we will give a new interpretation of the GLA and formulate the phase recovery problem in an optimization form. We then present some experimental results where the new algorithm is tested on various signals. It shows not only significant improvement in speed of convergence but it does as well recover the signals with a smaller error than the traditional GLA.

…

Figures - uploaded by Nathanaël Perraudin

Content may be subject to copyright.

Content uploaded by Nathanaël Perraudin

Content may be subject to copyright.

2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY

A FAST GRIFFIN-LIM ALGORITHM

Nathanaël Perraudin1, Peter Balazs2, Peter L. Søndergaard2

1EPFL, Switzerland, nathanael.perraudin@epﬂ.ch

2Acoustics Research Institute, Vienna, Austria

peter.soendergaard@oeaw.ac.at, peter.balazs@oeaw.ac.at

ABSTRACT

In this paper, we present a new algorithm to estimate a signal from

its short-time Fourier transform modulus (STFTM). This algorithm

is computationally simple and is obtained by an acceleration of

the well-known Grifﬁn-Lim algorithm (GLA). Before deriving

the algorithm, we will give a new interpretation of the GLA

and formulate the phase recovery problem in an optimization

form. We then present some experimental results where the new

algorithm is tested on various signals. It shows not only signiﬁcant

improvement in speed of convergence but it does as well recover

the signals with a smaller error than the traditional GLA.

Index Terms—Magnitude-only reconstruction, Short-time

Fourier transform, Phase reconstruction, time-scale modiﬁcation

(TSM), signal estimation, spectrogram inversion

I. INTRODUCTION

Time-frequency representations, in particular Gabor transforms

[1], i.e. the sampled Short-Time Fourier transforms (STFT), are

ubiquitous in signal processing. Gabor transforms describe a signal

in time and frequency simultaneously. This transformation is fast

(thanks to the Fast Fourier transform (FFT)) and provides a good

tool for signal modiﬁcation. If the magnitude squared of the STFT

is understood to be the "localized time-frequency power spectrum",

the phase remains a complicated object which is difﬁcult to modify

appropriately. As a consequence, most of the transformations on

the STFT work with the magnitude or the magnitude squared

(the spectrogram), leaving the phase unchanged or sometime

completely dropped. Since the STFT is a redundant construction,

the obtained coefﬁcients usually does not form a valid spectrogram

(ie: there exists no signal having exactly this spectrogram.)

For instance, in the case of adaptive ﬁltering like denoising, the

magnitude of the STFT is often modiﬁed without any modiﬁcation

of the phase [2].

Furthermore, accurate reconstruction of a signal from its spec-

trogram is also important. This is known as the phase recovery

problem: recovering a signal from the amplitude of some mea-

surements, only. In the inﬂuential paper [3], it was proven that

for frames with sufﬁciently high redundancy, a signal can be

reconstructed from the magnitude of its frame coefﬁcients only

(up to a global phase factor). Recent results [4] put the necessary

redundancy at 4L−4

4for a frame for CL.

The notion of valid spectrogram plays a very important role

in the problem: the STFT has to verify a so-called "consistency

criterion" [5], [6]. In fact, the set of complex STFT coefﬁcients

is a proper subset of the coefﬁcient space i.e. taking an array of

complex coefﬁcients usually does not correspond to the STFT of

a signal. As a result, modifying the magnitude of the STFT does

not in general lead to a valid spectrogram.

From this difference, we consider two different problems re-

spectively called:

•Phase recovery: constructing a signal from a valid spectro-

gram and no phase information.

•STFT magnitude approximation: constructing a signal from

a non-valid STFT magnitude and eventually some starting

phase.

Both of those problems can be solved using our algorithm which

is highly inspired by the Grifﬁn-Lim algorithm (GLA) [7]. This last

method (see section IV) performs iteratively two projections. We

propose to consider the difference between two iterations. Doing

so, we lose all theoretical guarantees of convergence. However, this

new structure allows a more accurate and faster convergence. We

also expect it to be compatible with GLA modiﬁcation presented

in [5], [8], [9].

After presenting brieﬂy the Gabor transform, we will give a

new interpretation of the problem and the GLA. This will lead to

a new proposition of algorithm which we call the fast Grifﬁn-Lim

algorithm (FGLA). We will then present simulation results.

It should be noted that both the GLA and the algorithm

presented in this paper can be applied to any frame, and not just

to Gabor frames. However, for clarity and simplicity, we shall

consider only the Gabor case in this paper.

II. GABOR THEORY

In this contribution, we consider Gabor systems G(g, a, M )

in CL. All signals and windows on CLare considered to have

periodic boundary conditions. For g∈CL, and integer a, M > 0,

we deﬁne the Gabor system

G(g, a, M ) := gm,n =g[· − na]e2πim·/Mn, m ,(1)

where m= 0,...,M −1is the index of the frequency-channel

and n= 0,...,N −1is the index of the time-position. If Gis

also a frame [10], we refer to the system as a Gabor frame. For

x∈CL, the corresponding Gabor transform is given by

(Gx)[m+nM] = hx, gm,ni=

L−1

l=0

x[l]gm,n[l],(2)

with the analysis operator Gthat is given by the matrix

G[m+nM, l] := Gg ,a,M [m+nM, l] := gm,n [l].

Gabor synthesis is performed by applying the conjugate trans-

pose of Gto a coefﬁcient sequence c∈CMN . The action of the

synthesis operator can be equivalently described as

xsyn[l] = (G∗c)[l] = X

m,n

c[m+nM]g[l−na]e2πiml/M .(3)

The concatenation S=G∗Gof analysis and synthesis opera-

tors is called the frame operator.

Reconstruction can be realized using the so-called canonical

dual system, obtained by inverting Sand deﬁned as

γm,n =S−1gm,n.(4)

2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY

In the particular case of Gabor frames, the canonical dual system

is again a Gabor frame, i.e. it equals G(γ0,0, a, M ). Thus we refer

to γ0,0=S−1gas the canonical dual window.

In this case the synthesis operator of γ0,0coincides

with the pseudo-inverse of the original analysis operator, i.e.

G∗

γ,a,M =G†. So the inversion formula reads

x[l] = X

m,n

hx, gm,niγm,n [l] = G†Gx[l].(5)

A particular way to modify the coefﬁcients is by multiplication

by a ﬁxed symbol s.

Mf =X

m,n

sm,n hx, gm,niγm,n .

Such operators, called multipliers can be deﬁned for all kind of

frames [11], and ﬁnd applications in acoustics, see e.g. [12].

III. THE PROBLEM

The problem can be expressed as ﬁnding a signal x∗∈RL(or

more generally in CL) from given a given set of non-negative

coefﬁcients s, such that the magnitude of the STFT of x∗:|Gx|,

is as close as possible to s. The `2-norm will be used as measure

of closeness. Mathematically, we formulate the problem in the

following form:

Given a frame Gand real positive coefﬁcients s=|s|,x∗is

the solution of

minimizex∈RLk|Gx| − sk2(6)

We will call sa valid STFT magnitude if there exists an xsuch

that |Gx|=s.

For convenience, we deﬁne an equivalent problem with the

optimization variable on the coefﬁcient side.

minimizec∈CMN k|c| − sk2s. t. ∃x∈RL|c=Gx(7)

Those deﬁnitions lead naturally to a measure of error:

E(x) = k|Gx| − sk2

ksk2

(8)

For convenience, instead of (8), we use the signal to noise ratio

of the STFT magnitude. This can be expressed

SSNR(x) = −10 log10 (E(x)) (9)

IV. THE GRIFFIN-LIM ALGORITHM (GLA)

The GLA (named after their authors) was presented in 1984

in [7]. It aims at estimating a signal from its modiﬁed short time

Fourier transform. The GLA is a version of the double-projection

algorithm originally suggested by Gerchberg and Saxton [13] for

solving the phase recovery problem in terms of the Fourier trans-

form. The Gerchberg-Saxton works for a non-redundant system

(the Fourier transform) by considering additional side-constraints

to make the solution unique. The GLA algorithm on the other hand

works for redundant systems without any side constraints, where

the uniqueness of the solution comes via the redundancy.

The GLA proceeds by projecting a signal iteratively onto two

different sets in CL

a×Mdenoted by C1and C2.

C1is the set of admissible points for problem (7). It is also the

set of coefﬁcients cthat can be reached from x∈RLthrough the

frame G, i.e. the range of G:

C1={c| ∃x∈RLs. t. c=Gx}(10)

This meets the hard constraint of problem (7). Note that C1is the

set that satisﬁes the consistency criterion [14]. By [10] we can

express the projection in the following way:

PC1(c) = GG†c(11)

Let C2to be the set of coefﬁcients minimizing (7) without

necessary satisfying the hard constraint. It is simply given by

C2=nc∈CMN 



|c|=so.

The projection onto C2is simply equivalent to forcing the magni-

tude of sto be celementwise:

PC2(c) = s·e·i∠c.(12)

The GLA can now be formulated (cf algo 1).

Algorithm 1 Grifﬁn-Lim algorithm (GLA)

Fix the initial phase ∠c0

Initialize c0=s·e·i∠c0

Iterate for n= 1,2, ...

cn=PC1(PC2(cn−1))

Until convergence

x∗=G†cn

Improvements of the GLA can be found in the literature. In [5],

an approximate way to perform the projection PC1is proposed. As

the projection operator is highly structured, it is normally applied

using a fast algorithm, and this structure cannot be exploited

in the approximation. We have therefore chosen not to use this

approximation in this paper.

In [15], [8] the Real-Time Spectrogram Inversion RTISI algo-

rithm, which is an extension of the GLA was proposed. Recon-

struction is performed piece by piece by using again GLA and a

clever starting point. A further improvement is presented in [9].

In the next section, we propose a different modiﬁcation for the

GLA. It should be possible to combine both modiﬁcations into

one algorithm, however the detailed analysis of this is beyond the

scope of this contribution.

V. THE FAST GRIFFIN LIM ALGORITHM (FGLA)

Equations (6) and (7) deﬁne the problem in an optimization

form. However, classic optimization algorithms cannot easily reach

a solution since both (7) and (6) are not convex. Phase recovery

was recently expressed as a convex optimization problem in [16],

[17]. However, nowadays, the heavy computation cost of the

method makes it unsuitable for long signal (i.e. L > 128). In this

contribution, we rather propose to search for the solution of the

non convex problem (7). In fact, we need to ﬁnd the intersection

of the two sets C1and C2. Iterative projections would converge

to an optimal solution if both sets would be convex. Our idea is

to make larger steps at each iteration. To do so, we will use the

information available in the previous iterations.

More precisely, we will replace the update rule of the Grifﬁn-

Lim

cn=PC1(PC2(cn−1)) (13)

cn=PC1(PC2(cn−1+αn∆cn−1)) (14)

where ∆cn=cn−cn−1. At convergence, (14) and (13) are equiv-

alent. However, (14) is a faster way to converge to the solution.

Indeed the parameter αn∆cn−1increases the steps depending on

the current iterations values.

The similar trick is used in the algorithm called "FISTA" (fast

iterative shrinkage thresholding algorithm) [18] that speeds up

the algorithm "ISTA". In this method, they provide the optimal

sequence of αnthat optimizes the convergence. In our case, the

computation of such sequence remains still an open question, due

to the non convexity of our problem. Thus, in the following, we

have considered the simple case: αbeing a constant.

Using (14), we deﬁne the algorithm 2 called the Fast Grifﬁn-

Lim algorithm (FGLA). We observe that the heavy part of the

2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY

computation takes place into the projection PC1which happens

only once per iteration in both algorithms. Hence we assume the

computation cost per iteration to be equivalent in both algorithms.

Since the projection only involves pure Gabor analysis and syn-

thesis, efﬁcient algorithms [19] for these operators can be used.

Note that changing the update rule suppresses all theoretical

guarantees of convergence. This is another open issue of this

contribution.

Algorithm 2 Fast Grifﬁn-Lim algorithm (FGLA)

Fix the initial phase ∠c0

Initialize c0=s·e·i∠c0,t0=PC2(PC1(c0))

Iterate for n= 1,2, ...

tn=PC1(PC2(cn−1))

cn=tn+αn(tn−tn−1)

Update αn

Until convergence

x∗=G†cn

VI. NUMERICAL RESULTS

In this section we present three different experiments of phase

reconstruction. We remind the reader that as the problem is

not convex, the algorithms will converge most likely to a local

minimum depending on the starting point.

Experiments were done using classical windows. Those pre-

sented in this paper use a Nuttall (ﬁgure 1), a Gaussian (ﬁgures

2) and a Hann window (ﬁgure 3). We choose as frame parameters

a= 32,M= 256. This makes a redundancy of 8that assures all

the information to lie in the spectrogram [3].

Using different parameters or windows lead to similar results.

A reproducible research addendum can be downloaded at http:

//unlocbox.sourceforge.net/rr/fgla/. From this archive, parameters

can be easily changed and other conﬁguration tested.

In the ﬁrst example, we aim at ﬁnding a signal from its

spectrogram (phase reconstruction). In this speciﬁc case we do

know that such a signal exists. The initial phase is simply set

to zero and the number of iterations for both algorithm is ﬁxed

to 100000. In ﬁgure 1, we observe that the FGLA does not only

converge faster (better average slopes), but also to points with

smaller error. Note that for the signal ’bat’, the new algorithm

was able to perform perfect reconstruction. This signal is very

short, only 400 samples. We also observe that, using the FGLA,

the SSNR is not strictly increasing from one iteration to another.

However in average, the SS N R is increasing.

In the second example, we start from a signal, compute the

Gabor coefﬁcients, apply a spectrogram multiplier and reconstruct

a new signal as good as we can. In that case, signals ﬁtting exactly

the modiﬁed STFTM usually do not exist. As a consequence, we

are looking for the signal with the best spectrogram approximation.

The applied multiplier is random. This multiplier is chosen because

it modiﬁes the spectrogram in a signiﬁcant way and, in that

case, algorithm usually need more iterations to converge. Other

multipliers gives similar results. The initial phase, this time, is not

set to zero like in the previous experiment, but we keep the original

phase of the STFT. We ﬁxed the maximum number of iterations

to 10000 as well. Generally, the new algorithm converges faster.

The SSNR is sometime improved, but not in a very signiﬁcant

manner.

In the third and last experiment, we analyze the effect of α

onto the FGLA. Figure 3 displays tests for various constants α.

α= 1 seems to be the limit of stability of the algorithm. α= 0

correspond to the Grifﬁn-Lim algorithm. Increasing αleads to

better results with some optimal value near 1but not bigger. As a

consequence, 0.99 has been chosen for the other experiments.

Figure 1. Phase recovery problem: SSNR through iterations for the GLA

and the FGLA.

Figure 2. STFT magnitude optimization problem: SSNR through iterations

for the GLA and the FGLA.

The algorithm presented in this paper has been incorporated as

an option for the frsynabs function in the the LTFAT toolbox,

[20].

2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY

Figure 3. Inﬂuence of the parameter alpha onto the FGLA.

VII. CONCLUSION

In this paper, we have presented the phase recovery problem

in the form of an optimization problem. This approach allows us

to give a new interpretation of the GLA in order to be able to

speed it up. We proposed an algorithm (FGLA) that was indeed

faster but seems also to converge to better points. However, any

theoretical guarantee of convergence has been lost in the process.

Practically, our algorithm can replace the GLA at a very low cost

of implementation and computation. In our further research ,we

will look for a convergence proof and an optimal sequence of αn

and possible merge our algorithm with the RTISI real-time GLA

algorithm.

Acknowledgment

This work was supported by the Austrian Science Fund (FWF)

START-project FLAME (“Frames and Linear Operators for Acous-

tical Modeling and Parameter Estimation”; Y 551-N13).

VIII. REFERENCES

[1] H. G. Feichtinger and T. Strohmer, Eds., Gabor Analysis and

Algorithms, Boston, 1998.

[2] P. Majdak, P. Balazs, W. Kreuzer, and M. Dörﬂer, “A time-

frequency method for increasing the signal-to-noise ratio

in system identiﬁcation with exponential sweeps,” in Pro-

ceedings of the 36th International Conference on Acoustics,

Speech and Signal Processing, ICASSP 2011, Prag, 2011.

[3] R. Balan, P. Casazza, and D. Edidin, “On signal reconstruc-

tion without phase,” Applied and Computational Harmonic

Analysis, vol. 20, no. 3, pp. 345–356, 2006.

[4] B. G. Bodmann and N. Hammen, “Stable phase retrieval with

low-redundancy frames,” arXiv preprint arXiv:1302.5487,

2013.

[5] J. Le Roux, H. Kameoka, N. Ono, and S. Sagayama, “Fast

signal reconstruction from magnitude stft spectrogram based

on spectrogram consistency,” in Proc. 13th International

Conference on Digital Audio Effects (DAFx-10), 2010, pp.

397–403.

[6] J. Le Roux and E. Vincent, “Consistent Wiener ﬁltering for

audio source separation,” Signal Processing Letters, IEEE,

vol. 20, no. 3, pp. 217–220, 2013.

[7] D. Grifﬁn and J. Lim, “Signal estimation from modiﬁed

short-time fourier transform,” Acoustics, Speech and Signal

Processing, IEEE Transactions on, vol. 32, no. 2, pp. 236–

243, 1984.

[8] X. Zhu, G. Beauregard, and L. Wyse, “Real-time signal esti-

mation from modiﬁed short-time fourier transform magnitude

spectra,” Audio, Speech, and Language Processing, IEEE

Transactions on, vol. 15, no. 5, pp. 1645–1653, 2007.

[9] V. Gnann and M. Spiertz, “Improving RTISI Phase Estima-

tion With Energy Order and Phase Unwrapping,” in Proc.

of International Conference on Digital Audio Effects DAFx,

vol. 10, 2010.

[10] O. Christensen, An Introduction to Frames and Riesz Bases.

Birkhäuser, 2003.

[11] P. Balazs, “Basic deﬁnition and properties of Bessel multi-

pliers,” Journal of Mathematical Analysis and Applications,

vol. 325, no. 1, pp. 571–585, January 2007. [Online].

Available: http://dx.doi.org/10.1016/j.jmaa.2006.02.012

[12] P. Balazs, B. Laback, G. Eckel, and W. A. Deutsch, “Time-

frequency sparsity by removing perceptually irrelevant

components using a simple model of simultaneous masking,”

IEEE Transactions on Audio, Speech and Language

Processing, vol. 18, no. 1, pp. 34–49, 2010. [Online].

Available: http://www.kfs.oeaw.ac.at/xxl/mask/mask.pdf

[13] R. W. Gerchberg and W. O. Saxton, “A practical algorithm

for the determination of the phase from image and diffraction

plane pictures,” Optik, vol. 35, no. 2, pp. 237–250, 1972.

[14] J. Le Roux, N. Ono, and S. Sagayama, “Explicit consistency

constraints for stft spectrograms and their application to

phase reconstruction,” Proc. SAPA, pp. 23–28, 2008.

[15] G. T. Beauregard, X. Zhu, and L. Wyse, “An efﬁcient algo-

rithm for real-time spectrogram inversion,” in Proceedings

of the 8th International Conference on Digital Audio Effects,

2005, pp. 116–118.

[16] E. J. Candes, T. Strohmer, and V. Voroninski, “Phaselift:

Exact and stable signal recovery from magnitude measure-

ments via convex programming,” Communications on Pure

and Applied Mathematics, 2012.

[17] D. L. Sun and J. O. Smith III, “Estimating a signal from

a magnitude spectrogram via convex optimization,” arXiv

preprint arXiv:1209.2076, 2012.

[18] A. Beck and M. Teboulle, “A fast iterative shrinkage-

thresholding algorithm for linear inverse problems,” SIAM

Journal on Imaging Sciences, vol. 2, no. 1, pp. 183–202,

2009.

[19] P. L. Søndergaard, “Efﬁcient Algorithms for the Discrete

Gabor Transform with a long FIR window,” J. Fourier Anal.

Appl., vol. 18, no. 3, pp. 456–470, 2012.

[20] P. L. Søndergaard, B. Torrésani, and P. Balazs, “The Linear

Time Frequency Analysis Toolbox,” International Journal of

Wavelets, Multiresolution Analysis and Information Process-

ing, vol. 10, no. 4, 2012.

Images that Sound: Composing Images and Sounds on a Single Canvas

Preprint

May 2024

Spectrograms are 2D representations of sound that look very different from the images found in our visual world. And natural images, when played as spectrograms, make unnatural sounds. In this paper, we show that it is possible to synthesize spectrograms that simultaneously look like natural images and sound like natural audio. We call these spectrograms images that sound. Our approach is simple and zero-shot, and it leverages pre-trained text-to-image and text-to-spectrogram diffusion models that operate in a shared latent space. During the reverse process, we denoise noisy latents with both the audio and image diffusion models in parallel, resulting in a sample that is likely under both models. Through quantitative evaluations and perceptual studies, we find that our method successfully generates spectrograms that align with a desired audio prompt while also taking the visual appearance of a desired image prompt. Please see our project page for video results: https://ificl.github.io/images-that-sound/

NeRAF: 3D Scene Infused Neural Radiance and Acoustic Fields

Preprint

Full-text available

May 2024

Sound plays a major role in human perception, providing essential scene information alongside vision for understanding our environment. Despite progress in neural implicit representations, learning acoustics that match a visual scene is still challenging. We propose NeRAF, a method that jointly learns acoustic and radiance fields. NeRAF is designed as a Nerfstudio module for convenient access to realistic audio-visual generation. It synthesizes both novel views and spatialized audio at new positions, leveraging radiance field capabilities to condition the acoustic field with 3D scene information. At inference, each modality can be rendered independently and at spatially separated positions, providing greater versatility. We demonstrate the advantages of our method on the SoundSpaces dataset. NeRAF achieves substantial performance improvements over previous works while being more data-efficient. Furthermore, NeRAF enhances novel view synthesis of complex scenes trained with sparse data through cross-modal learning.

The DKU Speech Synthesis System for 2019 Blizzard Challenge

Conference Paper

Sep 2019

High-resolution ground motion generation with time–frequency representation

Article

Full-text available

May 2024
B EARTHQ ENG

Data-driven deep learning application in earthquake engineering highlights the insufficient quantity and the imbalanced feature distribution of measured ground motions, which can be mitigated with artificial ones. Traditional ground motion generation techniques tend to extend the catalogs conditioning on existing records, while current deep learning-based methods such as generative adversarial networks (GANs) only provide limited duration or sampling rate, thus obstructing further applications. In this paper, an invertible time–frequency transformation process is employed, based on which the transformed earthquake representation is implicitly modeled by advanced GANs for high-resolution and unconditional generation. Moreover, leveraging the disentangling property of the GAN’s latent space, the newly developed latent space walking method is adopted to assure the generations with controllable time–frequency features. A feature-balanced generated ground motion dataset has been constructed in combination with the proposed methods, and the application potential was demonstrated through comparative experiments of different datasets.

A Flexible Online Framework for Projection-Based Stft Phase Retrieval

Conference Paper

Apr 2024

Robust Spoof Speech Detection Based on Multi-Scale Feature Aggregation and Dynamic Convolution

Conference Paper

Apr 2024

Live Iterative Ptychography with Projection-Based Algorithms

Conference Paper

Apr 2024

GLA-GRAD: A Griffin-Lim Extended Waveform Generation Diffusion Model

Conference Paper

Apr 2024

DSUSING: Dual Scale U-Nets for Singing Voice Synthesis

Conference Paper

Feb 2024

Low-Latency Neural Speech Phase Prediction Based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks

Article

Jan 2024

This paper presents a novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is a core module for direct wrapped phase prediction. This architecture consists of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. We mathematically demonstrate that the anti-wrapping function should possess three properties, namely parity, periodicity and monotonicity. We also achieve low-latency streamable phase prediction by combining causal convolutions and knowledge distillation training strategies. For both analysis-synthesis and specific speech generation tasks, experimental results show that our proposed neural speech phase prediction model outperforms the iterative phase estimation algorithms and neural network-based phase prediction methods in terms of phase prediction precision, efficiency and robustness. Compared with HiFi-GAN-based waveform reconstruction method, our proposed model also shows outstanding efficiency advantages while ensuring the quality of synthesized speech.

IMPROVING RTISI PHASE ESTIMATION WITH ENERGY ORDER AND PHASE UNWRAPPING

Article

Full-text available

Sep 2010

This paper presents two ways to improve the Real-Time Iterative Spectrogram Inversion (RTISI) algorithm. The standard RTISI phase estimator with look-ahead processes the buffered frames in reverse order. We show that better results are achieved by control-ling this order according to frame energy. Another improvement is to initialize the last row of the phase estimator buffer by progress-ing the unwrapped phase difference of the previous frames. Fur-thermore, we extend these improvements to dual window length phase estimation and analyze the performance in SER with respect to different analysis window lengths.

An Efficient Algorithm for Real-Time Spectrogram Inversion

Conference Paper

Full-text available

Sep 2005

We present a computationally efficient real-time algorithm for constructing audio signals from spectrograms. Spectrograms consist of a time sequence of short time Fourier transform magni- tude (STFTM) spectra. During the audio signal construction proc- ess, phases are derived for the individual frequency components so that the spectrogram of the constructed signal is as close as possi- ble to the target spectrogram given real-time constraints. The algorithm is a variation of the classic Griffin and Lim [1] tech- nique modified to be computable in real-time. We discuss the application of the algorithm to time-scale modification of audio signals such as speech and music, and performance is compared with [three] other methods. The new algorithm generates compa- rable or better results with significantly less computation. The phase consistency between adjacent frames produces excellent subjective sound quality with minimal fame transition artifacts.

Consistent Wiener Filtering for Audio Source Separation

Article

Full-text available

Mar 2013

Wiener filtering is one of the most ubiquitous tools in signal processing, in particular for signal denoising and source separation. In the context of audio, it is typically applied in the time-frequency domain by means of the short-time Fourier transform (STFT). Such processing does generally not take into account the relationship between STFT coefficients in different time-frequency bins due to the redundancy of the STFT, which we refer to as consistency. We propose to enforce this relationship in the design of the Wiener filter, either as a hard constraint or as a soft penalty. We derive two conjugate gradient algorithms for the computation of the filter coefficients and show improved audio source separation performance compared to the classical Wiener filter both in oracle and in blind conditions.

Efficient Algorithms for the Discrete Gabor Transform with a Long Fir Window

Article

Full-text available

Jun 2012

Peter L. Søndergaard

The Discrete Gabor Transform (DGT) is the most commonly used signal transform for signal analysis and synthesis using a linear frequency scale. The development of the Linear Time-Frequency Analysis Toolbox (LTFAT) has been based on a detailed study of many variants of the relevant algorithms. As a side result of these systematic developments of the subject, two new methods are presented here. Comparisons are made with respect to the computational complexity, and the running time of optimised implementations in the C programming language. The new algorithms have the lowest known computational complexity and running time when a long FIR window is used. The implementations are freely available for download. By summarizing general background information on the state of the art, this article can also be seen as a research survey, sharing with the readers experience in the numerical work in Gabor analysis.

The Linear Time Frequency Analysis Toolbox

Article

Full-text available

Jul 2012

The Linear Time Frequency Analysis Toolbox is a MATLAB/Octave toolbox for computational time-frequency analysis. It is intended both as an educational and computational tool. The toolbox provides the basic Gabor, Wilson and MDCT transform along with routines for constructing windows (filter prototypes) and routines for manipulating coefficients. It also provides a bunch of demo scripts devoted either to demonstrating the main functions of the toolbox, or to exemplify their use in specific signal processing applications. In this paper we describe the used algorithms, their mathematical background as well as some signal processing applications.

Explicit Consistency Constraints for STFT Spectrograms and their Application to Phase Reconstruction

Article

Full-text available

Jan 2008

As many acoustic signal processing methods, for example for source separation or noise canceling, operate in the magnitude spectrogram domain, the problem of reconstructing a percep-tually good sounding signal from a modified magnitude spec-trogram, and more generally to understand what makes a spec-trogram consistent, is very important. In this article, we derive the constraints which a set of complex numbers must verify to be a consistent STFT spectrogram, i.e. to be the STFT spectro-gram of a real signal, and describe how they lead to an objective function measuring the consistency of a set of complex num-bers as a spectrogram. We then present a flexible phase recon-struction algorithm based on a local approximation of the con-sistency constraints, explain its relation with phase-coherence conditions devised as necessary for a good perceptual sound quality, and derive a real-time time scale modification algorithm based on sliding-block analysis. Finally, we show how incon-sistency can be used to develop a spectrogram-based audio en-cryption scheme.

Fast Signal Reconstruction from Magnitude STFT Spectrogram based on Spectrogram Consistency

Article

Full-text available

The modification of magnitude spectrograms is at the core of many audio signal processing methods, from source separation to sound modification or noise canceling, and reconstructing a natu-ral sounding signal in such situations is thus a very important issue. This article presents recent theoretical and experimental develop-ments on the application to signal reconstruction from a modified magnitude spectrogram of the constraints that an array of complex numbers must verify to be a consistent short-time Fourier trans-form (STFT) spectrogram, i.e., to be the STFT spectrogram of an actual real-valued signal. We give here further theoretical insights, present several potential variations on our previously introduced algorithm, investigate various techniques to speed up the signal reconstruction process, and present a thorough experimental com-parison of the performance of all the considered algorithms.

Estimating a Signal from a Magnitude Spectrogram via Convex Optimization

Article

Sep 2012

The problem of recovering a signal from the magnitude of its short-time Fourier transform (STFT) is a longstanding one in audio signal processing. Existing approaches rely on heuristics that often perform poorly because of the nonconvexity of the problem. We introduce a formulation of the problem that lends itself to a tractable convex program. We observe that our method yields better reconstructions than the standard Griffin-Lim algorithm. We provide an algorithm and discuss practical implementation details, including how the method can be scaled up to larger examples.

Stable phase retrieval with low-redundancy frames

Article

Feb 2013

We investigate the recovery of vectors from magnitudes of frame coefficients when the frames have a low redundancy, meaning a small number of frame vectors compared to the dimension of the Hilbert space. We first show that for vectors in d dimensions, 4d-4 suitably chosen frame vectors are sufficient to uniquely determine each signal, up to an overall unimodular constant, from the magnitudes of its frame coefficients. Then we discuss the effect of noise and show that 8d-4 frame vectors provide a stable recovery if part of the frame coefficients is bounded away from zero. In this regime, perturbing the magnitudes of the frame coefficients by noise that is sufficiently small results in a recovery error that is at most proportional to the noise level.

Signal estimation from modified short-time Fourier transform

Conference Paper

May 1983
IEEE Trans Acoust Speech Signal Process

In this paper, we present an algorithm to estimate a signal from its modified short-time Fourier transform (STFT). This algorithm is computationally simple and is obtained by minimizing the mean squared error between the STFT of the estimated signal and the modified STFT. Using this algorithm, we also develop an iterative algorithm to estimate a signal from its modified STFT magnitude. The iterative algorithm is shown to decrease, in each iteration, the mean squared error between the STFT magnitude of the estimated signal and the modified STFT magnitude. The major computation involved in the iterative algorithm is the discrete Fourier transform (DFT) computation, and the algorithm appears to be real-time implementable with current hardware technology. The algorithm developed in this paper has been applied to the time-scale modification of speech. The resulting system generates very high-quality speech, and appears to be better in performance than any existing method.

A Fast Griffin–Lim Algorithm

Abstract and Figures

Recommended publications

Controlled Precision Volume Integration

Perturbative quantum-state estimation

Investigation of the Load-Flow Problem

Convergence acceleration for the multilevel Hartree–Fock model