Figure 1 - uploaded by Jody Kreiman
Content may be subject to copyright.
The inverse filtering process. From A. Ní Chasaide & C. Gobl, "Voice source variation," in W.J. Hardcastle & J. Laver, The Handbook of Phonetic Sciences (Oxford, Blackwell, 1997), p. 430.

The inverse filtering process. From A. Ní Chasaide & C. Gobl, "Voice source variation," in W.J. Hardcastle & J. Laver, The Handbook of Phonetic Sciences (Oxford, Blackwell, 1997), p. 430.

Contexts in source publication

Context 1
... of the shape of the harmonic part of the glottal source can be obtained by inverse filtering the voice signal ( Figure 1). In source-filter theory, the vocal tract is modeled as an all-pole filter shaping the input glottal source, and radiation at the lips (which increases the output sound energy level by 3 dB/octave) is modeled by a differentiator. ...
Context 2
... spectrum will now appear in the lower left part of the analysis window. Next, click the LPC button ( Figure 10). Select autocorrelation and preemphasis, as shown in Figure 10. ...
Context 3
... click the LPC button ( Figure 10). Select autocorrelation and preemphasis, as shown in Figure 10. Window size considerations are as above. ...
Context 4
... OK is clicked, the number of cycles in the upper right window will decrease, an LPC envelope will appear over the FFT spectrum, and numbers will appear in the table of formants and bandwidths in the upper left part of the window (Figure 11). An error signal will also appear under the waveform in the center of the window. ...
Context 5
... the left mouse button near the left peak in the error signal, and click the right button near the right peak. Precision is not critical, and the choice of peaks is not necessarily straightforward, as Figure 11 shows. You may have several choices of peak, or you may have to guess at the cycle boundaries. ...
Context 6
... run the inverse filter using the autocorrelation estimates of vocal tract resonances, just click the IF button on the toolbar (Figure 12). If the extension for the file in use is .AUD, the program assumes that this is a microphone signal and automatically cancels the radiation characteristic. ...
Context 7
... right panels of Figure 12 show the output of the inverse filter. The top tracing is the glottal waveform, the second trace is the flow derivative, and the bottom shows the spectrum of the flow derivative. ...
Context 8
... remove this unwanted resonance (or any other resonance), point the cursor at its location in the spectrum shown in the lower left panel of the display, and double right click. The formant will be deleted and the inverse filter automatically reapplied with the new vocal tract model, as shown in Figure 13. (Resonances can also be deleted manually by editing the values in the table, and then clicking the IF button to apply the new vocal tract model.) ...
Context 9
... add a new resonance, point the cursor to the appropriate place in the spectrum in the lower left panel of the display and double left click. A resonance will appear in that location (at 1872 Hz, indicated by an arrow in the figure) and in the table at the top of the display, with default bandwidth of 100 Hz (Figure 14). Notice the change in the shape of the flow derivative spectrum (lower right panel), also indicated by an arrow. ...
Context 10
... remove this resonance, double right click it, as described above. Figure 15 shows the effect of deleting a resonance from the analysis. In this case, the formant at 4636 Hz has been deleted (by double right clicking), resulting in a large increase in flow derivative ripple and an extra bump in the flow derivative spectrum, both indicated by arrows in the figure. ...
Context 11
... to a new position. Figure 16 shows the result of dragging F1 from its starting value of 829 Hz to a value of 711 Hz; notice the increase in ripple in the flow derivative. As the formant moves, the inverse filter and display update automatically, showing the effect of the new resonance value on the estimated glottal waveform, flow derivative, and flow derivative spectrum. ...
Context 12
... may also be manipulated interactively using the sliders to the right of the table of resonance values. Dragging a slider to the right widens the bandwidth of the resonance in question; in Figure 17, the bandwidth of the first formant has been widened to excess. Dragging the slider to the left narrows the bandwidth. ...
Context 13
... estimate the vocal tract using covariance analysis, begin by windowing the signal, calculating an FFT, and use autocorrelation LPC analysis to select a cycle and calculate F0, as described above. Then click the LPC button on the toolbar again, and this time select covariance ( Figure 18). The default window size of 56 points is usually too long. ...
Context 14
... default window size of 56 points is usually too long. Depending on F0, adjust this value so that Figure 16. Result of decreasing the frequency of F1 by dragging the resonance peak to a lower value. ...
Context 15
... you use fewer than 29 points, change the order to 12. When you click OK, a bar will appear above the time series waveform showing the position and size of the window applied in estimating the vocal tract (as indicated by the arrow in Figure 19). When you are satisfied with the window size, click the IF button to proceed with the analysis, as above. ...
Context 16
... you are satisfied with the window size, click the IF button to proceed with the analysis, as above. The output of the inverse filter based on the covariance LPC analysis is shown in Figure 19. The result is very similar to that obtained using autocorrelation LPC, except for a spurious formant at 5 kHz which produces a very steep drop-off in the flow derivative spectrum at high frequencies. ...
Context 17
... Save-Concatenated Cycles ( Figure 21). This creates a file, filenamec.aud ...
Context 18
... different algorithms are available for synthesizing frequency modulation (the acoustic correlate of tremor). The first sinusoidally modulates the vocal F0 above and below the mean F0 specified in the synthesizer (Figure 31a). This algorithm provided a good perceptual match to about one third of the voices we have studied ( Kreiman et al., 2003). ...
Context 19
... the frequency modulation for the other two thirds of voices studied was non-sinusoidal and irregular in rate. For this reason, a "random tremor" model has also been implemented ( Figure 31b). In this model, the pattern of variation in F0 is generated by passing white noise through an FIR Kaiser window low-pass filter with cutoff frequency equal to the maximum modulation rate. ...
Context 20
... you want to use the synthetic tremor models, select the tremor model you want to use. As described above, the sine wave model modulates F0 in a sinusoidal pattern; the random model creates an irregular pattern of frequency modulation (see Figure 31). Note that amplitude modulation and F0 modulation may be selected independently of one another. ...
Context 21
... model the F0 contour, click the FM button on the toolbar. A new pane opens in the window, as shown in Figure 41. The top plot in this pane shows the unsmoothed F0 track for the entire 1 sec voice sample. ...
Context 22
... a result, you can create some very unusual source pulses. Increasing the amplitude of a group of harmonics can also introduce the equivalent of a formant into the synthetic speech (Figure 51). This can be instructive and fun to play with, but be aware of what you are doing. ...

Similar publications

Article
Full-text available
This paper proposes an approach to transform speech from a neutral style into other expressive styles using both prosody and voice quality (VoQ). The main aim is to validate the usefulness of VoQ in the enhancement of expressive synthetic speech. A Harmonic plus Noise Model (HNM) is used to modify speech following a set of rules extracted from an e...
Article
Full-text available
The problem of classification efficiency analysis in speech signals for speaker identification using a machine learning approach has been a challenge for several years. The classification using the machine learning approach maps the speaker’s data of attention into several segments. For the speaker’s classification system, segments represent a uniq...
Article
Full-text available
PURPOSE: to determine the fundamental frequency (Fo) for the voice of 50 boys and 50 girls born and living in Belo Horizonte, whose ages range from 6 to 8 years. METHODS: both genders were chosen from Belo Horizonte city. The process of voice recording was done by using digital sustained vowel [ε] within proper tone and intensity, lasting three sec...
Article
Full-text available
Acoustic analysis is often favored over perceptual evaluation of voice because it is considered objective, and thus reliable. Specifically, jitter is frequently used as an index of pathologic voice quality because of its moderate to high correlation with vocal roughness. This study examined the relative reliability of human listeners and automatic...

Citations

... Since the subglottal pressure was unknown a priori, a constant 15-dB SPL correction was applied to all conditions. The oral volume flow rate was inverse filtered to obtain the glottal flow waveform using the INVF software developed at UCLA (Kreiman et al., 2016), from which the glottal flow-related measures (Qmean, Qamp, CQ, MFDR, and MADR) were extracted. Figure 1 compares the neural network-predicted subglottal pressure and the approximations from the intraoral air pressure measurement. ...
Article
Full-text available
We previously reported a simulation-based neural network for estimating vocal fold properties and subglottal pressure from the produced voice. This study aims to validate this neural network in a single–human subject study. The results showed reasonable accuracy of the neural network in estimating the subglottal pressure in this particular human subject. The neural network was also able to qualitatively differentiate soft and loud speech conditions regarding differences in the subglottal pressure and degree of vocal fold adduction. This simulation-based neural network has potential applications in identifying unhealthy vocal behavior and monitoring progress of voice therapy or vocal training.
... All synthesis was completed by the first author. Methods are described in detail in Kreiman et al. (2016) [see also Kreiman et al. (2010)]. Briefly, voice samples were inverse filtered using the method described by Javkin et al. (1987). ...
Article
Full-text available
No agreed-upon method currently exists for objective measurement of perceived voice quality. This paper describes validation of a psychoacoustic model designed to fill this gap. This model includes parameters to characterize the harmonic and inharmonic voice sources, vocal tract transfer function, fundamental frequency, and amplitude of the voice, which together serve to completely quantify the integral sound of a target voice sample. In experiment 1, 200 voices with and without diagnosed vocal pathology were fit with the model using analysis-by-synthesis. The resulting synthetic voice samples were not distinguishable from the original voice tokens, suggesting that the model has all the parameters it needs to fully quantify voice quality. In experiment 2 parameters that model the harmonic voice source were removed one by one, and the voice tokens were re-synthesized with the reduced model. In every case the lower-dimensional models provided worse perceptual matches to the quality of the natural tokens than did the original set, indicating that the psychoacoustic model cannot be reduced in dimensionality without loss of fit to the data. Results confirm that this model can be validly applied to quantify voice quality in clinical and research applications .
... An optimisation procedure is then carried out to refine the parameter estimates. Similar model-fitting approaches are used in the systems described in Kreiman et al. (2006) and Airas (2008). ...
Thesis
Full-text available
Statistical parametric speech synthesis (SPSS) offers a means of generating synthetic speech without the need for complex and extensive rules, but can sometimes lack in naturalness through the use of simple excitation models and an absence of prosodic variation. Improving the prosody of synthetic speech would be desirable for applications, such as educational games or communication systems for people with disordered speech. The use of an acoustic glottal model as the basis of the synthetic source signal could offer a more adequate modelling of prosody. However, these models can entail many potentially important parameters and controlling these could be challenging. The aims of this work were to: investigate how an acoustic glottal model could be used to manipulate aspects of linguistic and paralinguistic prosody of synthetic speech using a minimal set of control parameters; incorporate the knowledge gained from this investigation into an analysis-and-synthesis system; use this system in SPSS; and conduct preliminary tests to demonstrate how the system can be used to explore the voice source correlates of prosody with user-driven manipulation tasks. To achieve the first goal, experiments were carried out to explore how the global waveshape parameter, Rd, could be used to control aspects of linguistic and paralinguistic prosody. This parameter can be used to generate Liljencrants-Fant (LF) model pulses with different shapes that correspond to voice qualities ranging from breathy to tense. As the tense-lax dimension of voice quality is important in prosodic modulation, Rd appears to be ideal for minimising the number of control parameters needed to transform voice quality. Three experiments were carried out, using manually inverse filtered data, to investigate how Rd could be used as a control parameter for linguistic and paralinguistic prosody, even in the absence of f0 modulation. Experiment 1 examined how manipulating Rd could be used to control where focal prominence occurs in an utterance. Experiment 2 explored how Rd could be used as a control parameter for perceived affect. Experiment 3 built on the results of Experiment 1 to optimise the implementation of the Rd parameter contour. The results of these experiments confirmed that Rd can serve as a control parameter to generate linguistic prominence as well as paralinguistic modification of affective colouring. The results confirmed, and elaborated on, the findings of earlier research, suggesting that tense-lax modulation of voice quality is important in prosodic expression. They indicated that a more tense phonation on the focally accented item can be used to signal prominence, while laxer phonation of post-focal material provides source deaccentuation that enhances the perceived prominence. These experiments provided information concerning Rd ranges and settings that fed into the development of the second goal of this work, i.e. an analysis-and-synthesis system, called GlórCáil, for the control of parameters for prosodic variation in synthesis. The system also allows for some speaker characteristic transformation, letting the user manipulate both voice source and vocal tract parameters before resynthesis. This provides the means to alter the prosodic pattern and speaker characteristics of an utterance. The interface allows the user to listen to any changes they make after resynthesis, to see if they have the desired effect or if further manipulations are required. The third goal was achieved by integrating the finished system into a DNN-based speech synthesis framework so that it could be used to generate unseen synthetic speech. The final goal of this work was achieved by demonstrating the system’s ability to control aspects of linguistic and paralinguistic prosody, as well as speaker characteristics, in copy synthesis. Two manipulation tasks were carried out using purpose-built interfaces developed for these experiments. Experiment 4 involved participants modifying an utterance so that it sounded like an appropriate response to a given question by moving sliders that controlled the Rd parameter. Experiment 5 involved participants manipulating parameters to make an utterance sound like it was being spoken by a particular speaker in a particular affective state. The responses were then used to modify the default parameters generated by the DNN-based speech synthesis system, by multiplying them by a scaling factor, to create a set of stimuli. These stimuli were used in the listening test of Experiment 6, where participants were asked to identify the speaker, the emotion of the speaker, and rate the magnitude of the emotion and naturalness of the utterance on five-point scales. Although participants identified sad stimuli successfully, this was not the case for happy stimuli. It is likely that additional modifications of the vocal tract and f0 contour are needed to improve identification rates. The GlórCáil system experiments reported here are seen as a contribution not only towards better control of the voice quality dimension of prosody in speech synthesis, but also towards research methodologies that will enhance our understanding of this vital dimension of human communication.
... Other spectral measures discussed in the voice quality literature include the difference in amplitude between the second and fourth harmonics (H2-H4), for measuring pathological voice quality (Kreiman, Gerratt & Antoñanzas-Barroso 2006), the average of H1-H2 compared to A1, for measuring non-contrastive voice quality in English (Stevens 1988), and formant amplitude differences such as A2-A3 in English (Klatt & Klatt 1990). These are, however, not widely used in studies of linguistically contrastive voice quality. ...
Article
Full-text available
Gujarati and White Hmong are among a small handful of languages known to maintain a phonemic contrast between breathy and modal voice across both obstruents and vowels. Given that breathiness on stop consonants is realized as a breathy-voiced aspirated release into the following vowel, how is consonant breathiness distinguished from vocalic breathiness, if at all? We examine acoustic and electroglottographic data of potentially ambiguous CV sequences collected from speakers of Gujarati and White Hmong, to determine what properties reliably distinguish breathiness associated with stop consonants from breathiness associated with vowels comparing both within and across these two unrelated languages. Results from the two languages are strikingly similar: only the early timing and increased magnitude of the various acoustic reflexes of breathiness phonetically distinguish phonemic consonantal breathiness from phonemic vocalic breathiness.
... Much of the established literature on phonation is based on studies of non-phonological voice quality differences. These studies can be divided into two general categories: (1) studies of pathologically-disordered phonation (Childers & Lee, 1991;Kreiman, Gerratt, & Antoñ anzas-Barroso, 2006Lieberman, 1963) and (2) purely phonetic studies of non-modal phonation in English (Hanson, 1995(Hanson, , 1997Hanson & Chuang, 1999;Hillenbrand, Cleveland, & Erickson, 1994;Iseli, Shue, & Alwan, 2007;Klatt & Klatt, 1990). However, phonological uses of phonation within languages-both allophonic and contrastive-are arguably distinct (Blankenship, 2002;Hillenbrand et al., 1994, p. 777); like other areas of phonology, contrastive and allophonic voice quality can vary by localization (i.e. ...
... Each participant first recorded the syllable /ha/. A steadystate section of the vowel was inverse-filtered and re-synthesized with different levels of signal-to-noise ratio (SNR) using software developed by the Bureau of Glottal Affairs at UCLA (Kreiman, Gerratt & Antoñanzas-Barroso, 2006). SNR ranged from -20 to -11 dB. ...
Article
Full-text available
Adaptive learning of speech behavior has been demonstrated in the areas of vowel height and backness [Houde (1998); Guenther (2006)] and pitch [Larson (1998)]. In each of these studies, the acoustic feedback presented to a speaker was perturbed, and most speakers modified their speech behavior to compensate. Adaptive learning is distinguished from feedback control if the compensatory behavior persists briefly once the perturbation is removed. The aim of the current study is to examine compensatory vocal behavior when voice quality is artificially perturbed. The common voice disorder, muscle tension dysphonia (MTD), in which a patient is hoarse in the absence of any physiological impairments, may be triggered by this type of compensatory behavior such as adaptation during an upper‐respiratory infection. In this experiment, the perceived breathiness of speakers without vocal pathology will be artificially increased by mixing speech‐shaped noise into the vocal feedback presented to the speaker. In pilot data with one speaker, when noise was added, spectral slant (H2‐H1) increased. The increase in spectral slant indicates an increase in the closed phase of phonation, compensating for the perceived breathiness. Implications of these data will be discussed relative to the onset and maintenance of MTD.
... The purpose of this was to test both systems ability to deal with the distortions imposed on recorded signals by one particular recording setup. In total 120 test signals were automatically parameterised using the new spectral approach described in this paper and a standard time based LF model parameterisation tool (an implementation of the algorithm described in [12] and is commonly used in other voice source analysis tools [13, 14] ). Parameters values were analysed using two measures; relative change (RC) and Wilk's coefficient of variation (CV). ...
Conference Paper
Full-text available
This paper presents a new method of extracting LF model based parameters using a spectral model matching approach. Strategies are described for overcoming some of the known difficulties of this type of approach, in particular high frequency noise. The new method performed well compared to a typical time based method particularly in terms of robustness against distortions introduced by the recording system and in terms of the ability of parameters extracted in this manner to differentiate three discrete voice qualities. Results from this study are very promising for the new method and offer a way of extracting a set of non-redundant spectral parameters that may be very useful in both recognition and synthesis systems. Index Terms: LF model, voice source, parameterisation, robustness, classification.
... Much of the established literature on phonation is based on studies of non-phonological voice quality differences. These studies can be divided into two general categories: (1) studies of pathologically-disordered phonation (Childers & Lee, 1991;Kreiman, Gerratt, & Antoñ anzas-Barroso, 2006Lieberman, 1963) and (2) purely phonetic studies of non-modal phonation in English (Hanson, 1995(Hanson, , 1997Hanson & Chuang, 1999;Hillenbrand, Cleveland, & Erickson, 1994;Iseli, Shue, & Alwan, 2007;Klatt & Klatt, 1990). However, phonological uses of phonation within languages-both allophonic and contrastive-are arguably distinct (Blankenship, 2002;Hillenbrand et al., 1994, p. 777); like other areas of phonology, contrastive and allophonic voice quality can vary by localization (i.e. ...
Article
Full-text available
Gujarati is known for distinguishing breathy and modal phonation in both consonants and vowels [(Cardona and Suthar, 2003)], e.g., ba:r "twelve" versus Ba:r "burden" versus bA:r "outside," where uppercase represents breathiness. The current study investigates acoustic and glottographic properties of this three-way contrast. [Fischer-Jorgensen (1967)] and [Bickley (1982)] showed that the H1-H2 measure can reliably distinguish breathy and modal vowels, and [Esposito (2006)] further found that Gujarati listeners exclusively attend to H1-H2 to categorize audio samples from other languages, even when H1-H2 did not reliably distinguish contrastive voice quality in those languages. However, do other acoustic properties also distinguish breathiness in Gujarati? In addition what physiological differences underlie the acoustic differences? The current study tests the reliability of several phonetic measures (e.g., H1-H2, H1-A1, CPP, CQ, etc.) to distinguish 33 (near-) minimal sets contrasting breathy and modal segments in audio and electroglottographic data produced by both male and female Gujarati speakers. [Work supported by NSF].
... Each sample was copied using a custom formant synthesizer optimized for precisely modeling pathologic voice quality. Analysis and synthesis procedures are described in detail elsewhere Kreiman et al., 2006 . Briefly, the synthesizer sampling rate was fixed at 10 kHz. ...
Article
Full-text available
Modeling sources of listener variability in voice quality assessment is the first step in developing reliable, valid protocols for measuring quality, and provides insight into the reasons that listeners disagree in their quality assessments. This study examined the adequacy of one such model by quantifying the contributions of four factors to interrater variability: instability of listeners' internal standards for different qualities, difficulties isolating individual attributes in voice patterns, scale resolution, and the magnitude of the attribute being measured. One hundred twenty listeners in six experiments assessed vocal quality in tasks that differed in scale resolution, in the presence/absence of comparison stimuli, and in the extent to which the comparison stimuli (if present) matched the target voices. These factors accounted for 84.2% of the variance in the likelihood that listeners would agree exactly in their assessments. Providing listeners with comparison stimuli that matched the target voices doubled the likelihood that they would agree exactly. Listeners also agreed significantly better when assessing quality on continuous versus six-point scales. These results indicate that interrater variability is an issue of task design, not of listener unreliability.
... Each natural voice sample was copied using a custom formant synthesizer implemented in MATLAB (Mathworks, 2002). Analysis and synthesis procedures are described in detail elsewhere (Gerratt & Kreiman, 2001; Kreiman, Gerratt, & Antoñanzas-Barroso, 2006 ). 3 Briefly, the synthesizer sampling rate was fixed at 10 kHz. Parameters describing the harmonic part of the glottal source were estimated by inverse filtering a representative cycle of phonation for each voice using the method described by Javkin et al. (1987) . ...
Article
Full-text available
Many researchers have studied the acoustics, physiology, and perceptual characteristics of the voice source, but despite significant attention, it remains unclear which aspects of the source should be quantified and how measurements should be made. In this study, the authors examined the relationships among a number of existing measures of the glottal source spectrum, along with the association of these measures to overall spectral shapes and to glottal pulse shapes, to determine which measures of the source best capture information about the shapes of glottal pulses and glottal source spectra. Seventy-eight different measures of source spectral shapes were made on the voices of 70 speakers. Principal components analysis was applied to measurement data, and the resulting factors were compared with factors similarly derived from oral speech spectra and glottal pulses. Results revealed high levels of duplication and overlap among existing measures of source spectral slope. Further, existing measures were not well aligned with patterns of spectral variability. In particular, existing spectral measures do not appear to model the higher frequency parts of the source spectrum adequately. The failure of existing measures to adequately quantify spectral variability may explain why results of studies examining the perceptual importance of spectral slope have not produced consistent results. Because variability in the speech signal is often perceptually salient, these results suggest that most existing measures of source spectral slope are unlikely to be good predictors of voice quality.