Frequency of silent, breath, and filled pauses in recordings: mean (standard deviation)

Source publication

Fig. 1 Verification process in pauses?MFCC biometric speaker...

Table 1 Explanation of adopted abbreviations

Fig. 2 Different types of filled pauses and frequency of their...

Table 2 Frequency of punctuation in transcripts: mean (standard deviation)

Fig. 3 Different types of pauses determining a full stops and b commas...

Structure of pauses in speech in the context of speaker verification and classification of speech type

Article

Full-text available

Nov 2016

Statistics of pauses appearing in Polish as a potential source of biometry information for automatic speaker recognition were described. The usage of three main types of acoustic pauses (silent, filled and breath pauses) and syntactic pauses (punctuation marks in speech transcripts) was investigated quantitatively in three types of spontaneous spee...

Context 1

... of filled pauses in a minute of recordings was often sur- prisingly high, especially for inexperienced speakers (even above 10 per minute). Mean frequency of different types of pauses are compared in Table 3. ...

View in full-text

Quality Measures for Speaker Verification with Short Utterances

Article

Full-text available

Feb 2019

The performances of the automatic speaker verification (ASV) systems degrade due to the reduction in the amount of speech used for enrollment and verification. Combining multiple systems based on different features and classifiers considerably reduces speaker verification error rate with short utterances. This work attempts to incorporate supplemen...

VQ/GMM-Based Speaker Identification with Emphasis on Language Dependency: Volume Eight

Chapter

Full-text available

Jan 2019

The biometric recognition of human through the speech signal is known as automatic speaker recognition (ASR) or voice biometric recognition. Plenty of acoustic features have been used in ASR so far, but among them Mel-frequency cepstral coefficients (MFCCs) and Gammatone frequency cepstral coefficients (GFCCs) are popularly used. To make ASR langua...

Closed-set Text-independent Automatic Speaker Recognition System Using VQ/GMM

Conference Paper

Full-text available

Oct 2017

Automatic speaker recognition (ASR) is one type of biomet-ric recognition of human, known as voice biometric recognition. Among plenty of acoustic features, Mel-frequency Cepstral Coefficients (MFCCs) and Gammatone Frequency Cepstral Coefficients (GFCCs) are used popularly in ASR. The state-of-the-art techniques for modeling/classification(s) are V...

Forensic Automatic Speaker Recognition under Noisy Environments

Conference Paper

Full-text available

Nov 2018

In this paper, a Forensic Automatic Speaker Recognition System (FASRS) is implemented, to properly identify and authenticate a suspect in a simulation of police or legal investigation, using telephonic recordings of a universal database, in clean and noisy environments. In our case, we used the Bayesian interpretation to compute the Likelihood Rati...

Figure 1. Proposed variational autoencoder (VAE) for non-linear feature...

Figure 2. Flow chart of the speaker verification process using the...

Figure 3. Average differential entropy computed using the latent...

Figure 4. Network structure of the baseline VAE Classify.

Figure 6. DET curves of the speaker verification experiments using a...

Unsupervised Learning of Total Variability Embedding for Speaker Verification with Random Digit Strings

Article

Full-text available

Apr 2019

Recently, the increasing demand for voice-based authentication systems has encouraged researchers to investigate methods for verifying users with short randomized pass-phrases with constrained vocabulary. The conventional i-vector framework, which has been proven to be a state-of-the-art utterance-level feature extraction technique for speaker veri...

ПРОСОДИЧНА ПАУЗА В УСНИХ РЕПРЕЗЕНТАЦІЯХ ТЕКСТУ ВІЛЬНОГО ВІРША

Article

Jun 2024

У вільному вірші рядок виконує функцію ритмічної одиниці. Довжина рядків, їх просторове розташування на сторінці, кількість рядків у строфі, наявність/відсутність розділових знаків залежать від авторського розуміння їх смислового навантаження. При читанні вільного вірша вголос пауза як просодичний засіб членування мовленнєвого потоку на смислові одиниці є одним з індикаторів інтерпретації твору. У статті викладені результати міні-проєкту, до якого в якості дикторів та аудиторів були залучені студенти старших курсів факультету іноземних мов Прикарпатського національного університету імені Василя Стефаника. Метою проєкту було дослідження ролі паузи в усних читацьких репрезентаціях вільного вірша “The long hall” сучасної ірландсько-канадійської авторки Стіві Хауелл. Перший аудіозапис був зроблений до здійснення учасниками проєкту аналізу твору за методикою теорії світу тексту, яка передбачає створення ментальних репрезентацій на основі текстової інформації та знань читачів про реальний світ. Застосування принципів теорії світу тексту сприяло процесам інтерпретації дискурсу вірша, чим пояснюються відмінності між першим і другим аудіозаписами. Аудитивний аналіз засвідчує, що створення ментальних репрезентацій, світів тексту вірша, змінює ракурс сприйняття дискурсу. У другому аудіозаписі, на відміну від першого, помітна тенденція до перенесення уваги з граматико-синтаксичного на смисловий аспект твору. При другому прочитанні це знайшло відображення у кореляції пауз – їх місця розташування, характеру і тривалості – з авторським розподілом тексту між рядками, що подекуди суперечить граматичним нормам, але створює своєрідний ритм поезії, передає смислове навантаження кожного рядка і посилює емоційний вплив на слухача.

State of the Art on Speaker Diarization for low resource Languages

Article

Full-text available

Sep 2023

Speak more than 15001 different languages in India. Vocal and linguistic technology would be quite beneficial to most of them. Indian languages have unique features and similarities that must be used in order to provide voice recognition capability for these low-resource languages. A low-resource Automatic Speech Recognition system for Indian languages was developed with this objective in mind. Rapid progress has been achieved in speaker diarization across a variety of application domains. A historical assessment of speaker diarization technology, as well as contemporary advances in neural speaker diarization techniques, is presented in this study. By integrating recent advancements in neural techniques. We believe that this study significantly contributes to the community by providing valuable insights and paving the way for advancements in the field of speaker diarization, ultimately leading to more efficient and accurate results.

State of the Art on Speaker Diarization for low resource Languages

Article

Full-text available

Sep 2023

High-Level CNN and Machine Learning Methods for Speaker Recognition

Article

Full-text available

Mar 2023
SENSORS-BASEL

Speaker Recognition (SR) is a common task in AI-based sound analysis, involving structurally different methodologies such as Deep Learning or “traditional” Machine Learning (ML). In this paper, we compared and explored the two methodologies on the DEMoS dataset consisting of 8869 audio files of 58 speakers in different emotional states. A custom CNN is compared to several pre-trained nets using image inputs of spectrograms and Cepstral-temporal (MFCC) graphs. AML approach based on acoustic feature extraction, selection and multi-class classification by means of a Naïve Bayes model is also considered. Results show how a custom, less deep CNN trained on grayscale spectrogram images obtain the most accurate results, 90.15% on grayscale spectrograms and 83.17% on colored MFCC. AlexNet provides comparable results, reaching 89.28% on spectrograms and 83.43% on MFCC.The Naïve Bayes classifier provides a 87.09% accuracy and a 0.985 average AUC while being faster to train and more interpretable. Feature selection shows how F0, MFCC and voicing-related features are the most characterizing for this SR task. The high amount of training samples and the emotional content of the DEMoS dataset better reflect a real case scenario for speaker recognition, and account for the generalization power of the models.

Pause Length and Differences in Cognitive State Attribution in Native and Non-Native Speakers

Article

Full-text available

Jan 2023

Speech pauses between turns of conversations are crucial for assessing conversation partners’ cognitive states, such as their knowledge, confidence and willingness to grant requests; in general, speakers making longer pauses are regarded as less apt and willing. However, it is unclear if the interpretation of pause length is mediated by the accent of interactants, in particular native versus non-native accents. We hypothesized that native listeners are more tolerant towards long pauses made by non-native speakers than those made by native speakers. This is because, in non-native speakers, long pauses might be the result of prolonged cognitive processing when planning an answer in a non-native language rather than of a lack of knowledge, confidence or willingness. Our experiment, in which 100 native Polish-speaking raters rated native and non-native speakers of Polish on their knowledge, confidence and willingness, showed that this hypothesis was confirmed for perceived willingness only; non-native speakers were regarded as equally willing to grant requests, irrespective of their inter-turn pause durations, whereas native speakers making long pauses were regarded as less willing than those making short pauses. For knowledge and confidence, we did not find a mediating effect of accent; both native and non-native speakers were rated as less knowledgeable and confident when making long pauses. One possible reason for the difference between our findings on perceived willingness to grant requests versus perceived knowledge and confidence is that requests might be more socially engaging and more directly relevant for interpersonal cooperative interactions than knowledge that reflects on partners’ competence but not cooperativeness. Overall, our study shows that (non-)native accents can influence which cognitive states are signaled by different pause durations, which may have important implications for intercultural communication settings where topics are negotiated between native and non-native speakers.

Automatic screening of mild cognitive impairment and Alzheimer’s disease by means of posterior-thresholding hesitation representation

Article

Mar 2022
Comput Speech Lang

Dementia is a chronic or progressive clinical syndrome, characterized by the deterioration of problem-solving skills, memory and language. In Mild Cognitive Impairment (MCI), which is often considered to be the prodromal stage of dementia, there is also a subtle deterioration of these cognitive functions; however, it does not affect the patients’ ability to carry out simple everyday activities. The timely identification of MCI could provide more effective therapeutic interventions to delay progression, and to postpone the possible conversion to dementia. Since language changes in MCI are present even before the manifestation of other distinctive cognitive symptoms, a non-invasive way of early automatic screening could be the use of speech analysis. Earlier, our research team developed a set of temporal speech parameters that mainly focus on the amount of silence and hesitation, and demonstrated its applicability for MCI detection. However, for the automatic extraction of these attributes, the execution of a full Automatic Speech Recognition (ASR) process is necessary. In this study we propose a simpler feature extraction approach, which still quantifies the amount of silence and hesitation in the speech of the subject, but does not require the application of a full ASR system. We experimentally demonstrate that this approach, operating directly on the frame-level output of a HMM/DNN hybrid acoustic model, is capable of extracting attributes as useful as the ASR-based temporal parameter extraction workflow was able to. That is, on our corpus consisting of 25 healthy controls, 25 MCI and 25 mild AD subjects, we achieve a (three-class) classification accuracy of 70.7%, an F-measure score of 89.6 and a mean AUC score of 0.804. We also show that this approach can be applied on simpler, context-independent acoustic states with only a slight degradation of MCI and mild Alzheimer’s detection performance. Lastly, we investigate the usefulness of the three speaker tasks which are present in our recording protocol.

Multi-parametric analysis of speech timing in inter-talker identical twin pairs and cross-pair comparisons: Some forensic implications

Article

Full-text available

Jan 2022
PLOS ONE

The purpose of this study was to assess the speaker-discriminatory potential of a set of speech timing parameters while probing their suitability for forensic speaker comparison applications. The recordings comprised of spontaneous dialogues between twin pairs through mobile phones while being directly recorded with professional headset microphones. Speaker comparisons were performed with twins speakers engaged in a dialogue (i.e., intra-twin pairs) and among all subjects (i.e., cross-twin pairs). The participants were 20 Brazilian Portuguese speakers, ten male identical twin pairs from the same dialectal area. A set of 11 speech timing parameters was extracted and analyzed, including speech rate, articulation rate, syllable duration (V-V unit), vowel duration, and pause duration. Three system performance estimates were considered for assessing the suitability of the parameters for speaker comparison purposes, namely global Cllr, EER, and AUC values. These were interpreted while also taking into consideration the analysis of effect sizes. Overall, speech rate and articulation rate were found the most reliable parameters, displaying the largest effect sizes for the factor "speaker" and the best system performance outcomes, namely lowest Cllr, EER, and highest AUC values. Conversely, smaller effect sizes were found for the other parameters, which is compatible with a lower explanatory potential of the speaker identity on the duration of such units and a possibly higher linguistic control regarding their temporal variation. In addition, there was a tendency for speech timing estimates based on larger temporal intervals to present larger effect sizes and better speaker-discriminatory performance. Finally, identical twin pairs were found remarkably similar in their speech temporal patterns at the macro and micro levels while engaging in a dialogue, resulting in poor system discriminatory performance. Possible underlying factors for such a striking convergence in identical twins' speech timing patterns are presented and discussed.

Automatic macro segmentation into interaction sequence: a silence-based approach for meeting structuring

Conference Paper

Jun 2021

Just breathe: Towards real-time intervention for public speaking anxiety

Article

Mar 2021

Most speakers experience public speaking anxiety in one form or another, ranging from slight nervousness to paralyzing fear and panic. Public speaking anxiety negatively affects the quality of most presentations, but few people seek out treatments for their anxiety. Real-time public speaking anxiety interventions that assist presenters manage their anxiety during a presentation can be of use to many presenters. Motivated by research on deep breathing interventions for stress management, we conduct a study with eleven participants to explore designs for deep breathing interventions that can be used during a presentation in front of an audience. We developed and compared three prototypes that nudged presenters to perform different actions (deep breathing, pausing, or smiling and looking at the audience) during a presentation. We report on participants’ experience interacting with the prototypes and our findings on how sensor-driven technologies can nudge presenters to perform deep breathing as a just-in-time technique for reducing anxiety. Our findings reveal that such interventions should be automatically triggered by presenter anxiety level, use visual prompts that do not obscure the presentation content and be delivered at topic boundaries.

Explicitation and cognitive load in simultaneous interpreting: Product- and process-oriented analysis of trainee interpreters’ outputs

Article

Mar 2021
Interpreting

Ewa Gumul

This article investigates the correlation between explicitation and increased cognitive load in simultaneous interpreting by trainee interpreters. It has been hypothesised, on the one hand, that certain explicitating shifts in simultaneous interpreting may be caused by increased cognitive load and they may be performed in an attempt to mask processing problems; and, on the other, that performing explicitating shifts may lead to increased cognitive load and trigger processing problems. The study triangulates product analysis (manual comparison of source and target texts) with process analysis (retrospective protocols of the participants). In the product the correlation between the occurrence of explicitating shifts and increased cognitive load is sought by identifying problem indicators in the form of three types of disfluency: hesitation markers, false starts and anomalous pauses exceeding two seconds (performance measure). Retrospective protocols are analysed in search of reports of explicitating shifts and/or increased cognitive load experienced and/or the cognitive effort expended (subjective measure). The product analysis shows the correlation between explicitating shifts and cognitive load at the level of 31%. The Spearman’s rank-order correlation coefficient r=0.48 indicates that there is a positive association between these two variables. This finding is further confirmed by 122 retrospective comments of the subjects in the study. Keywords: simultaneous interpreting, explicitation, cognitive load, cognitive effort, process research, retrospective protocols https://benjamins.com/catalog/intp.00051.gum

Frequency of silent, breath, and filled pauses in recordings: mean (standard deviation)

Context in source publication

Similar publications

Citations