Conference PaperPDF Available

A Fast Griffin–Lim Algorithm

Authors:

Abstract and Figures

In this paper, we present a new algorithm to estimate a signal from its short-time Fourier transform modulus (STFTM). This algorithm is computationally simple and is obtained by an acceleration of the well-known Griffin-Lim algorithm (GLA). Before deriving the algorithm, we will give a new interpretation of the GLA and formulate the phase recovery problem in an optimization form. We then present some experimental results where the new algorithm is tested on various signals. It shows not only significant improvement in speed of convergence but it does as well recover the signals with a smaller error than the traditional GLA.
Content may be subject to copyright.
2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY
A FAST GRIFFIN-LIM ALGORITHM
Nathanaël Perraudin1, Peter Balazs2, Peter L. Søndergaard2
1EPFL, Switzerland, nathanael.perraudin@epfl.ch
2Acoustics Research Institute, Vienna, Austria
peter.soendergaard@oeaw.ac.at, peter.balazs@oeaw.ac.at
ABSTRACT
In this paper, we present a new algorithm to estimate a signal from
its short-time Fourier transform modulus (STFTM). This algorithm
is computationally simple and is obtained by an acceleration of
the well-known Griffin-Lim algorithm (GLA). Before deriving
the algorithm, we will give a new interpretation of the GLA
and formulate the phase recovery problem in an optimization
form. We then present some experimental results where the new
algorithm is tested on various signals. It shows not only significant
improvement in speed of convergence but it does as well recover
the signals with a smaller error than the traditional GLA.
Index TermsMagnitude-only reconstruction, Short-time
Fourier transform, Phase reconstruction, time-scale modification
(TSM), signal estimation, spectrogram inversion
I. INTRODUCTION
Time-frequency representations, in particular Gabor transforms
[1], i.e. the sampled Short-Time Fourier transforms (STFT), are
ubiquitous in signal processing. Gabor transforms describe a signal
in time and frequency simultaneously. This transformation is fast
(thanks to the Fast Fourier transform (FFT)) and provides a good
tool for signal modification. If the magnitude squared of the STFT
is understood to be the "localized time-frequency power spectrum",
the phase remains a complicated object which is difficult to modify
appropriately. As a consequence, most of the transformations on
the STFT work with the magnitude or the magnitude squared
(the spectrogram), leaving the phase unchanged or sometime
completely dropped. Since the STFT is a redundant construction,
the obtained coefficients usually does not form a valid spectrogram
(ie: there exists no signal having exactly this spectrogram.)
For instance, in the case of adaptive filtering like denoising, the
magnitude of the STFT is often modified without any modification
of the phase [2].
Furthermore, accurate reconstruction of a signal from its spec-
trogram is also important. This is known as the phase recovery
problem: recovering a signal from the amplitude of some mea-
surements, only. In the influential paper [3], it was proven that
for frames with sufficiently high redundancy, a signal can be
reconstructed from the magnitude of its frame coefficients only
(up to a global phase factor). Recent results [4] put the necessary
redundancy at 4L4
4for a frame for CL.
The notion of valid spectrogram plays a very important role
in the problem: the STFT has to verify a so-called "consistency
criterion" [5], [6]. In fact, the set of complex STFT coefficients
is a proper subset of the coefficient space i.e. taking an array of
complex coefficients usually does not correspond to the STFT of
a signal. As a result, modifying the magnitude of the STFT does
not in general lead to a valid spectrogram.
From this difference, we consider two different problems re-
spectively called:
Phase recovery: constructing a signal from a valid spectro-
gram and no phase information.
STFT magnitude approximation: constructing a signal from
a non-valid STFT magnitude and eventually some starting
phase.
Both of those problems can be solved using our algorithm which
is highly inspired by the Griffin-Lim algorithm (GLA) [7]. This last
method (see section IV) performs iteratively two projections. We
propose to consider the difference between two iterations. Doing
so, we lose all theoretical guarantees of convergence. However, this
new structure allows a more accurate and faster convergence. We
also expect it to be compatible with GLA modification presented
in [5], [8], [9].
After presenting briefly the Gabor transform, we will give a
new interpretation of the problem and the GLA. This will lead to
a new proposition of algorithm which we call the fast Griffin-Lim
algorithm (FGLA). We will then present simulation results.
It should be noted that both the GLA and the algorithm
presented in this paper can be applied to any frame, and not just
to Gabor frames. However, for clarity and simplicity, we shall
consider only the Gabor case in this paper.
II. GABOR THEORY
In this contribution, we consider Gabor systems G(g, a, M )
in CL. All signals and windows on CLare considered to have
periodic boundary conditions. For gCL, and integer a, M > 0,
we define the Gabor system
G(g, a, M ) := gm,n =g[· na]e2πim·/Mn, m ,(1)
where m= 0,...,M 1is the index of the frequency-channel
and n= 0,...,N 1is the index of the time-position. If Gis
also a frame [10], we refer to the system as a Gabor frame. For
xCL, the corresponding Gabor transform is given by
(Gx)[m+nM] = hx, gm,ni=
L1
X
l=0
x[l]gm,n[l],(2)
with the analysis operator Gthat is given by the matrix
G[m+nM, l] := Gg ,a,M [m+nM, l] := gm,n [l].
Gabor synthesis is performed by applying the conjugate trans-
pose of Gto a coefficient sequence cCMN . The action of the
synthesis operator can be equivalently described as
xsyn[l] = (Gc)[l] = X
m,n
c[m+nM]g[lna]e2πiml/M .(3)
The concatenation S=GGof analysis and synthesis opera-
tors is called the frame operator.
Reconstruction can be realized using the so-called canonical
dual system, obtained by inverting Sand defined as
γm,n =S1gm,n.(4)
2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY
In the particular case of Gabor frames, the canonical dual system
is again a Gabor frame, i.e. it equals G(γ0,0, a, M ). Thus we refer
to γ0,0=S1gas the canonical dual window.
In this case the synthesis operator of γ0,0coincides
with the pseudo-inverse of the original analysis operator, i.e.
G
γ,a,M =G. So the inversion formula reads
x[l] = X
m,n
hx, gm,niγm,n [l] = GGx[l].(5)
A particular way to modify the coefficients is by multiplication
by a fixed symbol s.
Mf =X
m,n
sm,n hx, gm,niγm,n .
Such operators, called multipliers can be defined for all kind of
frames [11], and find applications in acoustics, see e.g. [12].
III. THE PROBLEM
The problem can be expressed as finding a signal xRL(or
more generally in CL) from given a given set of non-negative
coefficients s, such that the magnitude of the STFT of x:|Gx|,
is as close as possible to s. The `2-norm will be used as measure
of closeness. Mathematically, we formulate the problem in the
following form:
Given a frame Gand real positive coefficients s=|s|,xis
the solution of
minimizexRLk|Gx| sk2(6)
We will call sa valid STFT magnitude if there exists an xsuch
that |Gx|=s.
For convenience, we define an equivalent problem with the
optimization variable on the coefficient side.
minimizecCMN k|c| sk2s. t. xRL|c=Gx(7)
Those definitions lead naturally to a measure of error:
E(x) = k|Gx| sk2
ksk2
(8)
For convenience, instead of (8), we use the signal to noise ratio
of the STFT magnitude. This can be expressed
SSNR(x) = 10 log10 (E(x)) (9)
IV. THE GRIFFIN-LIM ALGORITHM (GLA)
The GLA (named after their authors) was presented in 1984
in [7]. It aims at estimating a signal from its modified short time
Fourier transform. The GLA is a version of the double-projection
algorithm originally suggested by Gerchberg and Saxton [13] for
solving the phase recovery problem in terms of the Fourier trans-
form. The Gerchberg-Saxton works for a non-redundant system
(the Fourier transform) by considering additional side-constraints
to make the solution unique. The GLA algorithm on the other hand
works for redundant systems without any side constraints, where
the uniqueness of the solution comes via the redundancy.
The GLA proceeds by projecting a signal iteratively onto two
different sets in CL
a×Mdenoted by C1and C2.
C1is the set of admissible points for problem (7). It is also the
set of coefficients cthat can be reached from xRLthrough the
frame G, i.e. the range of G:
C1={c| xRLs. t. c=Gx}(10)
This meets the hard constraint of problem (7). Note that C1is the
set that satisfies the consistency criterion [14]. By [10] we can
express the projection in the following way:
PC1(c) = GGc(11)
Let C2to be the set of coefficients minimizing (7) without
necessary satisfying the hard constraint. It is simply given by
C2=ncCMN
|c|=so.
The projection onto C2is simply equivalent to forcing the magni-
tude of sto be celementwise:
PC2(c) = s·e·ic.(12)
The GLA can now be formulated (cf algo 1).
Algorithm 1 Griffin-Lim algorithm (GLA)
Fix the initial phase c0
Initialize c0=s·e·ic0
Iterate for n= 1,2, ...
cn=PC1(PC2(cn1))
Until convergence
x=Gcn
Improvements of the GLA can be found in the literature. In [5],
an approximate way to perform the projection PC1is proposed. As
the projection operator is highly structured, it is normally applied
using a fast algorithm, and this structure cannot be exploited
in the approximation. We have therefore chosen not to use this
approximation in this paper.
In [15], [8] the Real-Time Spectrogram Inversion RTISI algo-
rithm, which is an extension of the GLA was proposed. Recon-
struction is performed piece by piece by using again GLA and a
clever starting point. A further improvement is presented in [9].
In the next section, we propose a different modification for the
GLA. It should be possible to combine both modifications into
one algorithm, however the detailed analysis of this is beyond the
scope of this contribution.
V. THE FAST GRIFFIN LIM ALGORITHM (FGLA)
Equations (6) and (7) define the problem in an optimization
form. However, classic optimization algorithms cannot easily reach
a solution since both (7) and (6) are not convex. Phase recovery
was recently expressed as a convex optimization problem in [16],
[17]. However, nowadays, the heavy computation cost of the
method makes it unsuitable for long signal (i.e. L > 128). In this
contribution, we rather propose to search for the solution of the
non convex problem (7). In fact, we need to find the intersection
of the two sets C1and C2. Iterative projections would converge
to an optimal solution if both sets would be convex. Our idea is
to make larger steps at each iteration. To do so, we will use the
information available in the previous iterations.
More precisely, we will replace the update rule of the Griffin-
Lim
cn=PC1(PC2(cn1)) (13)
by
cn=PC1(PC2(cn1+αncn1)) (14)
where cn=cncn1. At convergence, (14) and (13) are equiv-
alent. However, (14) is a faster way to converge to the solution.
Indeed the parameter αncn1increases the steps depending on
the current iterations values.
The similar trick is used in the algorithm called "FISTA" (fast
iterative shrinkage thresholding algorithm) [18] that speeds up
the algorithm "ISTA". In this method, they provide the optimal
sequence of αnthat optimizes the convergence. In our case, the
computation of such sequence remains still an open question, due
to the non convexity of our problem. Thus, in the following, we
have considered the simple case: αbeing a constant.
Using (14), we define the algorithm 2 called the Fast Griffin-
Lim algorithm (FGLA). We observe that the heavy part of the
2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY
computation takes place into the projection PC1which happens
only once per iteration in both algorithms. Hence we assume the
computation cost per iteration to be equivalent in both algorithms.
Since the projection only involves pure Gabor analysis and syn-
thesis, efficient algorithms [19] for these operators can be used.
Note that changing the update rule suppresses all theoretical
guarantees of convergence. This is another open issue of this
contribution.
Algorithm 2 Fast Griffin-Lim algorithm (FGLA)
Fix the initial phase c0
Initialize c0=s·e·ic0,t0=PC2(PC1(c0))
Iterate for n= 1,2, ...
tn=PC1(PC2(cn1))
cn=tn+αn(tntn1)
Update αn
Until convergence
x=Gcn
VI. NUMERICAL RESULTS
In this section we present three different experiments of phase
reconstruction. We remind the reader that as the problem is
not convex, the algorithms will converge most likely to a local
minimum depending on the starting point.
Experiments were done using classical windows. Those pre-
sented in this paper use a Nuttall (figure 1), a Gaussian (figures
2) and a Hann window (figure 3). We choose as frame parameters
a= 32,M= 256. This makes a redundancy of 8that assures all
the information to lie in the spectrogram [3].
Using different parameters or windows lead to similar results.
A reproducible research addendum can be downloaded at http:
//unlocbox.sourceforge.net/rr/fgla/. From this archive, parameters
can be easily changed and other configuration tested.
In the first example, we aim at finding a signal from its
spectrogram (phase reconstruction). In this specific case we do
know that such a signal exists. The initial phase is simply set
to zero and the number of iterations for both algorithm is fixed
to 100000. In figure 1, we observe that the FGLA does not only
converge faster (better average slopes), but also to points with
smaller error. Note that for the signal ’bat’, the new algorithm
was able to perform perfect reconstruction. This signal is very
short, only 400 samples. We also observe that, using the FGLA,
the SSNR is not strictly increasing from one iteration to another.
However in average, the SS N R is increasing.
In the second example, we start from a signal, compute the
Gabor coefficients, apply a spectrogram multiplier and reconstruct
a new signal as good as we can. In that case, signals fitting exactly
the modified STFTM usually do not exist. As a consequence, we
are looking for the signal with the best spectrogram approximation.
The applied multiplier is random. This multiplier is chosen because
it modifies the spectrogram in a significant way and, in that
case, algorithm usually need more iterations to converge. Other
multipliers gives similar results. The initial phase, this time, is not
set to zero like in the previous experiment, but we keep the original
phase of the STFT. We fixed the maximum number of iterations
to 10000 as well. Generally, the new algorithm converges faster.
The SSNR is sometime improved, but not in a very significant
manner.
In the third and last experiment, we analyze the effect of α
onto the FGLA. Figure 3 displays tests for various constants α.
α= 1 seems to be the limit of stability of the algorithm. α= 0
correspond to the Griffin-Lim algorithm. Increasing αleads to
better results with some optimal value near 1but not bigger. As a
consequence, 0.99 has been chosen for the other experiments.
Figure 1. Phase recovery problem: SSNR through iterations for the GLA
and the FGLA.
Figure 2. STFT magnitude optimization problem: SSNR through iterations
for the GLA and the FGLA.
The algorithm presented in this paper has been incorporated as
an option for the frsynabs function in the the LTFAT toolbox,
[20].
2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics October 20-23, 2013, New Paltz, NY
Figure 3. Influence of the parameter alpha onto the FGLA.
VII. CONCLUSION
In this paper, we have presented the phase recovery problem
in the form of an optimization problem. This approach allows us
to give a new interpretation of the GLA in order to be able to
speed it up. We proposed an algorithm (FGLA) that was indeed
faster but seems also to converge to better points. However, any
theoretical guarantee of convergence has been lost in the process.
Practically, our algorithm can replace the GLA at a very low cost
of implementation and computation. In our further research ,we
will look for a convergence proof and an optimal sequence of αn
and possible merge our algorithm with the RTISI real-time GLA
algorithm.
Acknowledgment
This work was supported by the Austrian Science Fund (FWF)
START-project FLAME (“Frames and Linear Operators for Acous-
tical Modeling and Parameter Estimation”; Y 551-N13).
VIII. REFERENCES
[1] H. G. Feichtinger and T. Strohmer, Eds., Gabor Analysis and
Algorithms, Boston, 1998.
[2] P. Majdak, P. Balazs, W. Kreuzer, and M. Dörfler, A time-
frequency method for increasing the signal-to-noise ratio
in system identification with exponential sweeps, in Pro-
ceedings of the 36th International Conference on Acoustics,
Speech and Signal Processing, ICASSP 2011, Prag, 2011.
[3] R. Balan, P. Casazza, and D. Edidin, “On signal reconstruc-
tion without phase,” Applied and Computational Harmonic
Analysis, vol. 20, no. 3, pp. 345–356, 2006.
[4] B. G. Bodmann and N. Hammen, “Stable phase retrieval with
low-redundancy frames, arXiv preprint arXiv:1302.5487,
2013.
[5] J. Le Roux, H. Kameoka, N. Ono, and S. Sagayama, “Fast
signal reconstruction from magnitude stft spectrogram based
on spectrogram consistency, in Proc. 13th International
Conference on Digital Audio Effects (DAFx-10), 2010, pp.
397–403.
[6] J. Le Roux and E. Vincent, “Consistent Wiener filtering for
audio source separation,” Signal Processing Letters, IEEE,
vol. 20, no. 3, pp. 217–220, 2013.
[7] D. Griffin and J. Lim, “Signal estimation from modified
short-time fourier transform,” Acoustics, Speech and Signal
Processing, IEEE Transactions on, vol. 32, no. 2, pp. 236–
243, 1984.
[8] X. Zhu, G. Beauregard, and L. Wyse, “Real-time signal esti-
mation from modified short-time fourier transform magnitude
spectra,” Audio, Speech, and Language Processing, IEEE
Transactions on, vol. 15, no. 5, pp. 1645–1653, 2007.
[9] V. Gnann and M. Spiertz, “Improving RTISI Phase Estima-
tion With Energy Order and Phase Unwrapping, in Proc.
of International Conference on Digital Audio Effects DAFx,
vol. 10, 2010.
[10] O. Christensen, An Introduction to Frames and Riesz Bases.
Birkhäuser, 2003.
[11] P. Balazs, “Basic definition and properties of Bessel multi-
pliers,” Journal of Mathematical Analysis and Applications,
vol. 325, no. 1, pp. 571–585, January 2007. [Online].
Available: http://dx.doi.org/10.1016/j.jmaa.2006.02.012
[12] P. Balazs, B. Laback, G. Eckel, and W. A. Deutsch, “Time-
frequency sparsity by removing perceptually irrelevant
components using a simple model of simultaneous masking,”
IEEE Transactions on Audio, Speech and Language
Processing, vol. 18, no. 1, pp. 34–49, 2010. [Online].
Available: http://www.kfs.oeaw.ac.at/xxl/mask/mask.pdf
[13] R. W. Gerchberg and W. O. Saxton, “A practical algorithm
for the determination of the phase from image and diffraction
plane pictures,” Optik, vol. 35, no. 2, pp. 237–250, 1972.
[14] J. Le Roux, N. Ono, and S. Sagayama, “Explicit consistency
constraints for stft spectrograms and their application to
phase reconstruction,” Proc. SAPA, pp. 23–28, 2008.
[15] G. T. Beauregard, X. Zhu, and L. Wyse, “An efficient algo-
rithm for real-time spectrogram inversion, in Proceedings
of the 8th International Conference on Digital Audio Effects,
2005, pp. 116–118.
[16] E. J. Candes, T. Strohmer, and V. Voroninski, “Phaselift:
Exact and stable signal recovery from magnitude measure-
ments via convex programming, Communications on Pure
and Applied Mathematics, 2012.
[17] D. L. Sun and J. O. Smith III, “Estimating a signal from
a magnitude spectrogram via convex optimization, arXiv
preprint arXiv:1209.2076, 2012.
[18] A. Beck and M. Teboulle, “A fast iterative shrinkage-
thresholding algorithm for linear inverse problems, SIAM
Journal on Imaging Sciences, vol. 2, no. 1, pp. 183–202,
2009.
[19] P. L. Søndergaard, “Efficient Algorithms for the Discrete
Gabor Transform with a long FIR window,” J. Fourier Anal.
Appl., vol. 18, no. 3, pp. 456–470, 2012.
[20] P. L. Søndergaard, B. Torrésani, and P. Balazs, “The Linear
Time Frequency Analysis Toolbox,” International Journal of
Wavelets, Multiresolution Analysis and Information Process-
ing, vol. 10, no. 4, 2012.
... For the audio model, we use Auffusion 4 [106], which finetunes Stable Diffusion v1.5 on log-mel spectrograms. To synthesize audio from the log-mel spectrograms, we consider two options: following [106] and using off-the-shelf HiFi-GAN [59] vocoder, or the Griffin-Lim algorithm [43,82]. We use HiFi-GAN for our main experiments. ...
... This suggests that our samples are not simply adversarial examples, but rather truly spectrograms that look like images. We also experiment with using Griffin-Lim [82] as a vocoder, with similar results to HiFi-GAN as shown in Fig. 6. We opt to use HiFi-GAN as our default vocoder as it outperforms Griffin-Lim in audio quality, with Griffin-Lim attaining a CLAP score of 0.302, compared to 0.335 obtained from HiFi-GAN. ...
Preprint
Spectrograms are 2D representations of sound that look very different from the images found in our visual world. And natural images, when played as spectrograms, make unnatural sounds. In this paper, we show that it is possible to synthesize spectrograms that simultaneously look like natural images and sound like natural audio. We call these spectrograms images that sound. Our approach is simple and zero-shot, and it leverages pre-trained text-to-image and text-to-spectrogram diffusion models that operate in a shared latent space. During the reverse process, we denoise noisy latents with both the audio and image diffusion models in parallel, resulting in a sample that is likely under both models. Through quantitative evaluations and perceptual studies, we find that our method successfully generates spectrograms that align with a desired audio prompt while also taking the visual appearance of a desired image prompt. Please see our project page for video results: https://ificl.github.io/images-that-sound/
... Finally, we use Griffin-Lim [23,38] algorithm to obtain RIR waveforms from magnitude STFTs. ...
Preprint
Full-text available
Sound plays a major role in human perception, providing essential scene information alongside vision for understanding our environment. Despite progress in neural implicit representations, learning acoustics that match a visual scene is still challenging. We propose NeRAF, a method that jointly learns acoustic and radiance fields. NeRAF is designed as a Nerfstudio module for convenient access to realistic audio-visual generation. It synthesizes both novel views and spatialized audio at new positions, leveraging radiance field capabilities to condition the acoustic field with 3D scene information. At inference, each modality can be rendered independently and at spatially separated positions, providing greater versatility. We demonstrate the advantages of our method on the SoundSpaces dataset. NeRAF achieves substantial performance improvements over previous works while being more data-efficient. Furthermore, NeRAF enhances novel view synthesis of complex scenes trained with sparse data through cross-modal learning.
... This approach is able to compute the speech directly from graphemes or phonemes. Followed by the Griffin-Lim algorithm [11] or WaveNet [12] vocoder, Tacotron2 [1] is able to yield synthetic speech that approaches the real human speech. However, it is hard to control the speaking style when using the end-to-end speech synthesis architectures. ...
Article
Full-text available
Data-driven deep learning application in earthquake engineering highlights the insufficient quantity and the imbalanced feature distribution of measured ground motions, which can be mitigated with artificial ones. Traditional ground motion generation techniques tend to extend the catalogs conditioning on existing records, while current deep learning-based methods such as generative adversarial networks (GANs) only provide limited duration or sampling rate, thus obstructing further applications. In this paper, an invertible time–frequency transformation process is employed, based on which the transformed earthquake representation is implicitly modeled by advanced GANs for high-resolution and unconditional generation. Moreover, leveraging the disentangling property of the GAN’s latent space, the newly developed latent space walking method is adopted to assure the generations with controllable time–frequency features. A feature-balanced generated ground motion dataset has been constructed in combination with the proposed methods, and the application potential was demonstrated through comparative experiments of different datasets.
Article
This paper presents a novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is a core module for direct wrapped phase prediction. This architecture consists of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. We mathematically demonstrate that the anti-wrapping function should possess three properties, namely parity, periodicity and monotonicity. We also achieve low-latency streamable phase prediction by combining causal convolutions and knowledge distillation training strategies. For both analysis-synthesis and specific speech generation tasks, experimental results show that our proposed neural speech phase prediction model outperforms the iterative phase estimation algorithms and neural network-based phase prediction methods in terms of phase prediction precision, efficiency and robustness. Compared with HiFi-GAN-based waveform reconstruction method, our proposed model also shows outstanding efficiency advantages while ensuring the quality of synthesized speech.
Article
Full-text available
This paper presents two ways to improve the Real-Time Iterative Spectrogram Inversion (RTISI) algorithm. The standard RTISI phase estimator with look-ahead processes the buffered frames in reverse order. We show that better results are achieved by control-ling this order according to frame energy. Another improvement is to initialize the last row of the phase estimator buffer by progress-ing the unwrapped phase difference of the previous frames. Fur-thermore, we extend these improvements to dual window length phase estimation and analyze the performance in SER with respect to different analysis window lengths.
Conference Paper
Full-text available
We present a computationally efficient real-time algorithm for constructing audio signals from spectrograms. Spectrograms consist of a time sequence of short time Fourier transform magni- tude (STFTM) spectra. During the audio signal construction proc- ess, phases are derived for the individual frequency components so that the spectrogram of the constructed signal is as close as possi- ble to the target spectrogram given real-time constraints. The algorithm is a variation of the classic Griffin and Lim [1] tech- nique modified to be computable in real-time. We discuss the application of the algorithm to time-scale modification of audio signals such as speech and music, and performance is compared with [three] other methods. The new algorithm generates compa- rable or better results with significantly less computation. The phase consistency between adjacent frames produces excellent subjective sound quality with minimal fame transition artifacts.
Article
Full-text available
Wiener filtering is one of the most ubiquitous tools in signal processing, in particular for signal denoising and source separation. In the context of audio, it is typically applied in the time-frequency domain by means of the short-time Fourier transform (STFT). Such processing does generally not take into account the relationship between STFT coefficients in different time-frequency bins due to the redundancy of the STFT, which we refer to as consistency. We propose to enforce this relationship in the design of the Wiener filter, either as a hard constraint or as a soft penalty. We derive two conjugate gradient algorithms for the computation of the filter coefficients and show improved audio source separation performance compared to the classical Wiener filter both in oracle and in blind conditions.
Article
Full-text available
The Discrete Gabor Transform (DGT) is the most commonly used signal transform for signal analysis and synthesis using a linear frequency scale. The development of the Linear Time-Frequency Analysis Toolbox (LTFAT) has been based on a detailed study of many variants of the relevant algorithms. As a side result of these systematic developments of the subject, two new methods are presented here. Comparisons are made with respect to the computational complexity, and the running time of optimised implementations in the C programming language. The new algorithms have the lowest known computational complexity and running time when a long FIR window is used. The implementations are freely available for download. By summarizing general background information on the state of the art, this article can also be seen as a research survey, sharing with the readers experience in the numerical work in Gabor analysis.
Article
Full-text available
The Linear Time Frequency Analysis Toolbox is a MATLAB/Octave toolbox for computational time-frequency analysis. It is intended both as an educational and computational tool. The toolbox provides the basic Gabor, Wilson and MDCT transform along with routines for constructing windows (filter prototypes) and routines for manipulating coefficients. It also provides a bunch of demo scripts devoted either to demonstrating the main functions of the toolbox, or to exemplify their use in specific signal processing applications. In this paper we describe the used algorithms, their mathematical background as well as some signal processing applications.
Article
Full-text available
As many acoustic signal processing methods, for example for source separation or noise canceling, operate in the magnitude spectrogram domain, the problem of reconstructing a percep-tually good sounding signal from a modified magnitude spec-trogram, and more generally to understand what makes a spec-trogram consistent, is very important. In this article, we derive the constraints which a set of complex numbers must verify to be a consistent STFT spectrogram, i.e. to be the STFT spectro-gram of a real signal, and describe how they lead to an objective function measuring the consistency of a set of complex num-bers as a spectrogram. We then present a flexible phase recon-struction algorithm based on a local approximation of the con-sistency constraints, explain its relation with phase-coherence conditions devised as necessary for a good perceptual sound quality, and derive a real-time time scale modification algorithm based on sliding-block analysis. Finally, we show how incon-sistency can be used to develop a spectrogram-based audio en-cryption scheme.
Article
Full-text available
The modification of magnitude spectrograms is at the core of many audio signal processing methods, from source separation to sound modification or noise canceling, and reconstructing a natu-ral sounding signal in such situations is thus a very important issue. This article presents recent theoretical and experimental develop-ments on the application to signal reconstruction from a modified magnitude spectrogram of the constraints that an array of complex numbers must verify to be a consistent short-time Fourier trans-form (STFT) spectrogram, i.e., to be the STFT spectrogram of an actual real-valued signal. We give here further theoretical insights, present several potential variations on our previously introduced algorithm, investigate various techniques to speed up the signal reconstruction process, and present a thorough experimental com-parison of the performance of all the considered algorithms.
Article
The problem of recovering a signal from the magnitude of its short-time Fourier transform (STFT) is a longstanding one in audio signal processing. Existing approaches rely on heuristics that often perform poorly because of the nonconvexity of the problem. We introduce a formulation of the problem that lends itself to a tractable convex program. We observe that our method yields better reconstructions than the standard Griffin-Lim algorithm. We provide an algorithm and discuss practical implementation details, including how the method can be scaled up to larger examples.
Article
We investigate the recovery of vectors from magnitudes of frame coefficients when the frames have a low redundancy, meaning a small number of frame vectors compared to the dimension of the Hilbert space. We first show that for vectors in d dimensions, 4d-4 suitably chosen frame vectors are sufficient to uniquely determine each signal, up to an overall unimodular constant, from the magnitudes of its frame coefficients. Then we discuss the effect of noise and show that 8d-4 frame vectors provide a stable recovery if part of the frame coefficients is bounded away from zero. In this regime, perturbing the magnitudes of the frame coefficients by noise that is sufficiently small results in a recovery error that is at most proportional to the noise level.
Conference Paper
In this paper, we present an algorithm to estimate a signal from its modified short-time Fourier transform (STFT). This algorithm is computationally simple and is obtained by minimizing the mean squared error between the STFT of the estimated signal and the modified STFT. Using this algorithm, we also develop an iterative algorithm to estimate a signal from its modified STFT magnitude. The iterative algorithm is shown to decrease, in each iteration, the mean squared error between the STFT magnitude of the estimated signal and the modified STFT magnitude. The major computation involved in the iterative algorithm is the discrete Fourier transform (DFT) computation, and the algorithm appears to be real-time implementable with current hardware technology. The algorithm developed in this paper has been applied to the time-scale modification of speech. The resulting system generates very high-quality speech, and appears to be better in performance than any existing method.