Figure 3 - uploaded by Daniel Hirst
Content may be subject to copyright.
Coding an F0 rise using a quadratic spline function. 

Coding an F0 rise using a quadratic spline function. 

Citations

... intonation initially proposed by Hirst and Espesser (1993), Hirst and Di Cristo (1998), Hirst et al. (2000) and Hirst (2005) in which whole melodies are relevant in that they can express various functions of intonation, such as the grammatical one. The INTSINT was initially devised to present a narrower, but at the same time language-unspecific, transcription of intonation which is language non-specific that could be applied especially in intonation typology, very close to an IPA-like type of intonation transcription, unlike the ToBI, which is compared by Hirst and Di Cristo (1998) Pitch movement is represented using the Momel system as a quadratic function spline in which the pitch symbols are used to define pitch points or targets (Hirst and Di Cristo 1998: 15 The main criticism of the system is that the transcription fails to account for more refined details of intonation: "If the main advantage pertains to the elimination of most of the micro-melodic factors in the fundamental frequency curve, the benefits are not clear compared to raw F0 data" (Martin 2015: 41). ...
Thesis
Full-text available
Within the field of Romance phonetics and phonology, the intonation of the Daco- Romance languages (Romanian, Aromanian, Megleno-Romanian and Istro-Romanian) has been a much-neglected topic. In fact, until relatively recently, little was known about the general importance of intonation in speech and about its forms and functions. Intonation in Daco-Romance was investigated only marginally, usually in mainstream Romanian grammar compendia, which doomed it to be a virtually unstudied area. Although there are several short descriptions of Romanian intonation (Dascălu-Jinga 1971, 1998, 2001; Vasiliu 1965; Chițoran, Pârlog and Augerot 1984; Chițoran 2002) they were not conducted in any particular framework and were mainly impressionistic in character. It is apparent that a fresh comprehensive approach to intonation in Romanian and in Eastern Romance in general is needed as a basis for future pedagogical, typological, and comparative research. After a critical account of major intonation theories – the IPO theory, the ‘traditional British’ system and the Autosegmental-Metrical (AM) theory – it is argued that the most suitable framework in which this project should be conducted is the AM theory. The main aim of the present thesis is to propose a comprehensive model for intonation in Romanian and the other Daco-Romance varieties based on the Autosegmental-Metrical theory (Pierrehumbert 1980, Ladd 2008 [1996], Gussenhoven 2004). This will involve the first Romanian ToBI (Ro-ToBI) transcription of intonation and show how focus is realised in the language. After providing an inventory of pitch accents and boundary tones, special attention is given to broad focus and narrow/contrastive focus in yes-no questions and wh-questions, which were reported to be peculiar in Romanian intonation compared with other (Western) Romance languages (Ladd 2008). For this purpose, 12 native speakers of all four Daco-Romance varieties were interviewed, which resulted in a spontaneous corpus (short conversations or short stories), and a semi-spontaneous corpus (questionnaires specially designed to elicit broad, narrow and contrastive focus, as well as other specific types of intonation). Acoustic analyses were performed in PRAAT followed by a comparative study of Daco-Romanian, Aromanian, Megleno-Romanian, and Istro-Romanian. In order to facilitate research and comparative studies across Romance languages, the data presented in this thesis was obtained using two intonation questionnaires based on the Discourse Completion Test (initially developed by Blum-Kulka et al. 1989) which 4 included some 31 situations designed to elicit a large number of specific sentence types and pragmatic meanings and eight different focus contexts. An analysis of the intonational phonology of Daco-Romance varieties suggests that they tend to align more with each other than with the non-Romance languages with which they are in contact. With respect to focus, the findings presented here suggest that the Nuclear Stress Rule (NSR) (Zubizarreta 1998; 2010) applies in Eastern Romance only to a certain extent in broad focus contexts, but not in narrow focus which allows contextual de-accenting. The results presented showed that Daco-Romance has a very rich and diverse intonational phonology as a bridge prosodic system between Slavic and Romance. The outcome of the project will not only have applications for automatic speech recognition (TTS systems) but will also help us to better understand intonational phonology in Romance in general.
... Using prosodic information several language processing task can be performed such as speaker recognition, breakdown phrase/sentence and tagging, dialog act division and tagging and disfluency detection etc. A prosody model is created on the basis of these features of speech and then is used to build an outcome detection system [19][20][21]. ...
... Implementation of the same is done by Praat"s software which is autocorrelation based. The proposed work, has taken numerous forms of slope revealing tasks so required prosodic speech features are educe from particular word preceding a boundary and each word following a boundary [20] [22]. Fig. 2 shows the procedure of educe of prosodic features of speech signal. ...
... Burg algorithm is used for formant prediction with 10ms; 25ms time step window width and 5500 Hz maximum formant frequency. We have implemented prosodic features extraction by MATLAB and Praat software [20] [25]. ...
Article
Speaker recognition is a biometric sensory system which uses human voice for recognition process. Due to the secure and significant use of such systems, Performance improvement is crucial factor (e.g. including voice based banking, access control, crime investigation purpose etc.). Authors of this manuscript have proposed a Speaker Recognition System (SRS) framework for developing speaker recognition system. Proposed framework has mainly six phases named as speech acquisition phase, features extraction phase, speaker modeling phase, pattern matching phase, decision phase and performance evaluation phase. The proposed framework is implemented by using prosodic features. The major reason behind using prosodic features for speaker identification is that these features improve system performance and consistency. The prosodic features are robust against noise and channel effect. Training and testing databases have been created using enrolled speaker"s voice. Experiments are performed on the created voice databases of male and female utterances. The obtained outcomes indicate performance improvement archived of the recognition rate of features extracted using Prosodic, ranging from 95.74% to 94.61%.
... These are used in previous work (e.g., [15]) and are shown to be useful for various tasks. For estimating values of slope changes, we used the Momel algorithm [16] to reconstruct the pitch values of the unvoiced segment and then estimated pitch slope features. ...
Conference Paper
Accurate prominence annotation benefits many spoken language understanding tasks as well as speech synthesis. In this work, we conduct a thorough study using acoustic prosodic cues for prominence detection in speech. This study is different from previous work in several aspects. In addition to the widely used prosodic features, such as pitch, energy, and duration, we introduce the use of cepstral features. Furthermore, we evaluate the effect of different features, speaker dependency and variation, different classifiers, and contextual information. Our experiments on the Boston University Radio News Corpus show that although the cepstral features alone do not perform well, when combined with prosodic features they yield some performance gain and, more importantly, can reduce much of the speaker variation in this task. We find that the previous context is more informative than the following context, and their combination achieves the best performance. The final result using selected features with context information is significantly better than that in previous work.
... Next, Section 2 presents our prosody model and describes the metrics used in our state-of-the-art QBH system. Corpus used in the experiment is presented and evaluation based on correlation and EER contrary to human judges is demonstrated in Section 3, which is followed by conclusion in Section 4. Fig.1 Illustrates the flowchart of the proposed algorithm, Pitch and recognition results of speech are first acquired by our automatic speech recognition engine, then error correction and stylization in pitch are then adopted to remove micro-prosodic disturbance and hypothesize the pitch level of voiceless stretches of speech, which is then followed by Momel stylization [4] and Fujisaki model extractor [5] of pitch both in training and test data. After that DTW and EMD metric in QBH is performed respectively to the stylized pitch curve and Fujisaki model. ...
... The consideration of stylization is that when evaluating goodness of rhythm, listeners tend to intuitively appreciate the overlook of pitch tendency and rhythm skeleton, they seem to ignore unvoiced speech and perception unconsciously bridges the silent gap by filling in the missing part of the pitch contour. A popular pitch stylization method is MOMEL (modeling melody) by Hirst [4], which is a micro-prosody filter proved to be better than simple interpolation. It lies on the acceptation that melodic curve can be, by pieces, approximated with a best second degree polynomial. ...
... The effectiveness of MSD-HMM was demonstrated on the recognition of Mandarin read speech and noisy speech [1] [13]. Here we extend it to spontaneous speech and compare the performance with conventional interpolation methods [6][7]. ...
Conference Paper
In this paper, we present a comparative study between spontaneous speech and read Mandarin speech in the context of automatic speech recognition. We focus on analysis and modeling of prosodic features, based on a unique speech corpus that contains similar amounts of read and spontaneous speech data from the same group of speakers. Statistical analysis is carried out on tone contours and duration of syllable and subsyllable units. Speech recognition experiments are performed to evaluate the effectiveness of different approaches to incorporate prosodic features into acoustic modeling. A key problem being addressed is how to deal with the unvoiced frames where F0 values are unavailable. We apply the technique of Multispace distribution (MSD) to model partially continuous F0 contours. For spontaneous speech, the tonal-syllable error rate is reduced from the MFCC baseline of 64.8% to 59.4% with the MSD based prosody model. For read speech, the performance improves from 46.0% to 36.4%.
... They are transcribed in various symbols: the Korean alphabet, a Romanized transcription, IPA and SAMPA. For the prosodic annotation, an automatic algorithm of pitch stylization and a prosodic annotation system, Momel and INTSINT are used[3, 4, 5]. By using Momel[3, 4], the pitch targets for the original and stylized F0 values are extracted. ...
... For the prosodic annotation, an automatic algorithm of pitch stylization and a prosodic annotation system, Momel and INTSINT are used[3, 4, 5]. By using Momel[3, 4], the pitch targets for the original and stylized F0 values are extracted. Then the stylized curves are manually corrected, so that the F0 values of the hand-corrected pitch targets are provided along with those extracted by Momel in the corpus. ...
... The MOMEL (MOdeling MELody) algorithm proposes a method of automatic stylization of F0 as a sequence of target points by means of a quadratic spline function [3, 4]. Given that F0 variations are considered as the superposition of two components, a microprosodic component, corresponding to local variations of pitch caused by the phonetic nature of the speech segments, and a macroprosodic component corresponding to the overall pitch pattern of the utterance, the Momel algorithm enables to represent the macroprosodic component as a sequence of pitch targets. ...
Conference Paper
Full-text available
This paper describes the contents of the Korean prosody corpus (Korean MULTEXT), which is a Korean version of the speech database Eurom1. The corpus consists of about 2 hours of read speech, transcribed primarily in orthography (in Korean alphabet and in a Romanized transcription), in IPA and in SAMPA. Furthermore, it includes the original F0 values, stylized F0 values extracted using Momel, and hand-corrected F0 values. The prosodic events are annotated in two ways. They are annotated with the automatic annotation algorithm, INTSINT, and also labeled manually into prosodic units with two tones on the hand-corrected pitch targets. It is found that the resulting tone patterns from the proposed Momel-based two tone labeling correspond to those defined in K-ToBI.
... The the tag set is: M (medium), T (top), B (bottom), H (higher), L (lower), U (up-step), D (down-step) and S (same). Tags are computed automatically by the INSINT tool using the MOMEL algorithm [9], and the MES software tool [10]. MOMEL provides a default stylized F0 contour; then a perceptual verification task is performed by human annotators. ...
Article
Full-text available
In this paper a methodology and preliminary results of a machine learning experiment for correlating intonation patterns and speaker information with dialogue acts are presented. The goal of this work is to assess the extent to which prosodic and speaker data can help to identify obligation dialogue acts within a specific practical-dialogues audio and video corpus in Mexican Span-ish. The machine learning method is decision trees. Current results show that the presented methodology is useful to the prediction of dialogue acts for the construction of conversational systems.
... The discontinuity in F0 contour between voiced and unvoiced transition is one reason why building a succinct F0 model is not so straightforward. Many ad hoc approaches like interpolating F0 in unvoiced segments to get around the problem have been proposed [1][2][3][4]. The interpolated F0 can be generated from a quadratic spline function [1], an exponential decay function towards the running F0 average [2], or a probability density function (pdf) with a large variance [3][4]. ...
... Many ad hoc approaches like interpolating F0 in unvoiced segments to get around the problem have been proposed [1][2][3][4]. The interpolated F0 can be generated from a quadratic spline function [1], an exponential decay function towards the running F0 average [2], or a probability density function (pdf) with a large variance [3][4]. These approaches are instrumentally effective to incorporate F0 as extra information with other short-time spectral features frame synchronously. ...
... Target phrases were excerpted from recordings and digitized at 16kHz. The fundamental frequency (F0) of each excerpt was modeled with a quadratic spline function using an automatic modeling algorithm, MOMEL (Hirst & Espesser, 1993) with manual corrections. The modeled F0 is represented by a sequence of < ms, Hz> target points corresponding to relevant local F0 turning points, as illustrated in Figure 1. ...
Article
Full-text available
In addition to the phrase-final accent (FA), the French phonological system includes a phonetically distinct Initial Accent (IA). The present study tested two proposals: that IA marks the onset of phonological phrases, and that it has an independent rhythmic function. Eight adult native speakers of French were instructed to read syntactically ambiguous French sentences (e.g., Les gants et les bas lisses 'the smooth gloves and stockings') in a way that disambiguated the scope of the adjective. When the final adjective (lisses) applies to the conjoined NP, a prosodic boundary is warranted immediately before the adjective; when it applies to the second NP alone, a boundary before that NP is more appropriate. Length of the second noun and the adjective were varied from one to four syllables to investigate length-related tendencies toward phonological boundary marking and toward rhythmic placement of IA. For the materials from six speakers whose readings were correctly interpreted by native listeners, incidence of word-initial prosodic peaks was affected by both structure and length, with most reliable occurrence at onsets of Minor/Phonological Phrases. The only effect of rhythmicity independent of phrase structure was omission of FA in stress clash with IA.
... The discontinuity between voiced and unvoiced segments has traditionally made tone modeling difficult. Many ad hoc approaches have been proposed to interpolate F0 in unvoiced segments to bypass the discontinuity problem1234. The interpolated F0s are generated from a quadratic spline function [1], an exponential decay function towards the running F0 average [2], or a probability density function (pdf) with a very large variance34. ...
... Many ad hoc approaches have been proposed to interpolate F0 in unvoiced segments to bypass the discontinuity problem1234. The interpolated F0s are generated from a quadratic spline function [1], an exponential decay function towards the running F0 average [2], or a probability density function (pdf) with a very large variance34. Despite their heuristic nature, these approaches are reasonably effective in incorporating F0 as extra components in the short-time acoustic features. ...