Table 2 - uploaded by Ravichander Vipperla
Content may be subject to copyright.
Word Error Rate with artificial reduction in F 0 . 

Word Error Rate with artificial reduction in F 0 . 

Source publication
Article
Full-text available
With ageing, human voices undergo several changes which are typically characterized by increased hoarseness and changes in articulation patterns. In this study, we have examined the effect on Automatic Speech Recognition (ASR) and found that the Word Error Rates (WER) on older voices is 10% absolute higher compared to those of adult voices. Subsequ...

Context in source publication

Context 1
... word error rates before and after reduction in pitch are given in Table 2. The WER increases by 1.1% absolute to 33.2% and is statistically significant with P < .001 ...

Citations

... (Adda-Decker and Lamel 2005; Garnerin, Rossato, and Besacier 2019) have pointed out the gender bias in ASR systems which favors female speakers when they benchmarked the ASRs performance for English and French news broadcast dataset. (Vipperla, Renals, and Frankel 2010) have audited the impact of speaker age on the ASR performance. Most of the existing studies fall under the purview of black box audits (Sandvig et al. 2014) due to the lack of access to model architecture and training data for ASRs supplied by commercial vendors (Koenecke et al. 2020;Tatman and Kasten 2017). ...
... Experience (Vipperla, Renals, and Frankel 2010) have highlighted that the organs involved in speech production mechanism of individuals like lungs, vocal cords and the vocal cavities get affected with age which in turn affects the ar-ticulation of words.The speakers with the highest experience are also expected to be the oldest and vice versa. The differences in median WER of the least and most experienced speakers in both ASRs depicts the inability of the ASR model to account for these phonetic variations.Similarly, the highest disparity in substitution error of both ASRs points toward the same direction. ...
Article
Automatic speech recognition (ASR) systems are designed to transcribe spoken language into written text and find utility in a variety of applications including voice assistants and transcription services. However, it has been observed that state-of-the-art ASR systems which deliver impressive benchmark results, struggle with speakers of certain regions or demographics due to variation in their speech properties. In this work, we describe the curation of a massive speech dataset of 8740 hours consisting of ~9.8K technical lectures in the English language along with their transcripts delivered by instructors representing various parts of Indian demography. The dataset is sourced from the very popular NPTEL MOOC platform. We use the curated dataset to measure the existing disparity in YouTube Automatic Captions and OpenAI Whisper model performance across the diverse demographic traits of speakers in India. While there exists disparity due to gender, native region, age and speech rate of speakers, disparity based on caste is non-existent. We also observe statistically significant disparity across the disciplines of the lectures. These results indicate the need of more inclusive and robust ASR systems and more representational datasets for disparity evaluation in them.
... (Adda-Decker and Lamel 2005; Garnerin, Rossato, and Besacier 2019) have pointed out the gender bias in ASR systems which favors female speakers when they benchmarked the ASRs performance for English and French news broadcast dataset. (Vipperla, Renals, and Frankel 2010) have audited the impact of speaker age on the ASR performance. Most of the existing studies fall under the purview of black box audits (Sandvig et al. 2014) due to the lack of access to model architecture and training data for ASRs supplied by commercial vendors (Koenecke et al. 2020;Tatman and Kasten 2017). ...
... Experience (Vipperla, Renals, and Frankel 2010) have highlighted that the organs involved in speech production mechanism of individuals like lungs, vocal cords and the vocal cavities get affected with age which in turn affects the articulation of words.The speakers with the highest experience are also expected to be the oldest and vice versa. The differences in median WER of the least and most experienced speakers in both ASRs depicts the inability of the ASR model to account for these phonetic variations.Similarly, the highest disparity in substitution error of both ASRs points toward the same direction. ...
Preprint
Automatic speech recognition (ASR) systems are designed to transcribe spoken language into written text and find utility in a variety of applications including voice assistants and transcription services. However, it has been observed that state-of-the-art ASR systems which deliver impressive benchmark results, struggle with speakers of certain regions or demographics due to variation in their speech properties. In this work, we describe the curation of a massive speech dataset of 8740 hours consisting of $\sim9.8$K technical lectures in the English language along with their transcripts delivered by instructors representing various parts of Indian demography. The dataset is sourced from the very popular NPTEL MOOC platform. We use the curated dataset to measure the existing disparity in YouTube Automatic Captions and OpenAI Whisper model performance across the diverse demographic traits of speakers in India. While there exists disparity due to gender, native region, age and speech rate of speakers, disparity based on caste is non-existent. We also observe statistically significant disparity across the disciplines of the lectures. These results indicate the need of more inclusive and robust ASR systems and more representational datasets for disparity evaluation in them.
... Speech disorders such as dysarthria are associated with neuro-motor conditions and often found among elderly adults experiencing neurocognitive disorders such as Alzheimer's disease (AD). ASR technologies tailored for their needs not only improve their quality of life, but also support large scale automatic early diagnosis of neurocognitive impairment [4,[8][9][10][11][12]. ...
... Speech disorders such as dysarthria are associated with neuro-motor conditions and often found among elderly adults experiencing neurocognitive disorders such as Alzheimer's disease (AD). ASR technologies tailored for their needs not only improve their quality of life, but also support large scale automatic early diagnosis of neurocognitive impairment [4,[8][9][10][11][12]. ...
Preprint
Automatic recognition of disordered and elderly speech remains a highly challenging task to date due to the difficulty in collecting such data in large quantities. This paper explores a series of approaches to integrate domain adapted SSL pre-trained models into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition: a) input feature fusion between standard acoustic frontends and domain adapted wav2vec2.0 speech representations; b) frame-level joint decoding of TDNN systems separately trained using standard acoustic features alone and with additional wav2vec2.0 features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain adapted wav2vec2.0 models. In addition, domain adapted wav2vec2.0 representations are utilized in acoustic-to-articulatory (A2A) inversion to construct multi-modal dysarthric and elderly speech recognition systems. Experiments conducted on the UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest TDNN and Conformer ASR systems integrated domain adapted wav2vec2.0 models consistently outperform the standalone wav2vec2.0 models by statistically significant WER reductions of 8.22% and 3.43% absolute (26.71% and 15.88% relative) on the two tasks respectively. The lowest published WERs of 22.56% (52.53% on very low intelligibility, 39.09% on unseen words) and 18.17% are obtained on the UASpeech test set of 16 dysarthric speakers, and the DementiaBank Pitt test set respectively.
... These age alterations can be problematic for speech technologies. For example, a decrease in performance of the automatic speech recognition (ASR) systems has been reported (Vipperla et al., 2010;Hämäläinen et al., 2014;Das et al., 2013) and several efforts have been made to study older speech (Pellegrini et al., 2013;Das et al., 2013) and balance ASR performance across ages (Hämäläinen et al., 2014;Das et al., 2013). Directly or indirectly, these efforts can profit from more knowledge regarding the aging effects. ...
Article
Full-text available
Despite speech being inherently dynamic, most of the acoustic studies and age classification experiments have focused on static features (or on delta and delta-delta features) to characterize vowel acoustics. As such, knowledge regarding age effects in vowel acoustics is limited and doesn’t consider the dynamic aspects. This study intends not only to understand the usefulness of dynamic features for the classification tasks and the characterization of vowel acoustics, but also to analyze how age affects both dynamic features and classification of vowels. The performance of several age and vowel classification models were investigated with a dataset of oral vowels from 112 European Portuguese speakers (ranging from ages 35 to 97). Features consisted of the first 3 DCT coefficients (C0 to C2) and a set of 5 representative types of classifiers were used (Discriminant Analysis, Support Vector Machines, Naïve Bayes, Decision Tree and Ensemble). The accuracy results of age classification experiments showed improvement with the addition of dynamic features for the several age divisions considered. This improvement was also noticeable in vowel classification but of lower magnitude. Statistical tests established a connection of the dynamic features with age: vowel formant dynamics, particularly C1 of first formant (F1) is affected by age. Some gender differences were found. Globally, results tend to support the hypothesis that dynamic features of vowels carry important information about the speaker and are an interesting source of speaker-discriminating information.
... Alzheimer's disease (AD) is the most frequent form of dementia found in aged people. Its characteristics include progressive degradation of the memory, cognition, and motor skills, and consequently decline in the speech and language skills of patients [1,2]. Currently, there is no effective cure for AD [3], but an intervention approach applied in time can postpone its progression and reduce the negative impact on patients [4]. ...
Preprint
Full-text available
Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care and to delay further progression. Speech based automatic AD screening systems provide a non-intrusive and more scalable alternative to other clinical screening techniques. Textual embedding features produced by pre-trained language models (PLMs) such as BERT are widely used in such systems. However, PLM domain fine-tuning is commonly based on the masked word or sentence prediction costs that are inconsistent with the back-end AD detection task. To this end, this paper investigates the use of prompt-based fine-tuning of PLMs that consistently uses AD classification errors as the training objective function. Disfluency features based on hesitation or pause filler token frequencies are further incorporated into prompt phrases during PLM fine-tuning. The exploit of the complementarity between BERT or RoBERTa based PLMs that are either prompt learning fine-tuned, or optimized using conventional masked word or sentence prediction costs, decision voting based system combination between them is further applied. Mean, standard deviation and the maximum among accuracy scores over 15 experiment runs are adopted as performance measurements for the AD detection system. Mean detection accuracy of 84.20% (with std 2.09%, best 87.5%) and 82.64% (with std 4.0%, best 89.58%) were obtained using manual and ASR speech transcripts respectively on the ADReSS20 test set consisting of 48 elderly speakers.
... • Vocal range e.g. fundamental frequency, and pitch range [25]. ...
Thesis
Full-text available
Speech Recognition is a growing field since last five decades and there have been many advancements which has led to its applications like Speech to Text which allowed the possibility of Transcription of speech audio to text. Much of work is available on this in English, Arabic and Cantonese Languages. However, Urdu is a low-resource language in field of ASR although it is the world's 11th most widely spoken language, with 232 Million speakers worldwide. We found no applicable models in our research to readily deploy Speech To Text in a noisy telephonic scenario. Apart from that we faced code-switching problem i.e. in normal telephonic or call-center conversations in Urdu, people tend to spontaneously use words from other language since Pakistan is a multi-cultural society. Hence, we proposed an implementation of Automatic Speech Recognition/ Speech to Text System in a noisy/ call center environment with less labelled training data available using Hybrid HMM-DNN in a Resource constraint environment in terms of time, budget, computation power, HR etc. We were able to access to call center large amount of unlabelled audio dataset, thanks to CPLC (a semi-government Law Enforcement Agency), some of which were labelled manually. We further integrated various open source data-sets to include more variety in data-set. The data comprised of mix of noisy and clean audio as well as single utterances and long sentences (1-20 second audios). It was split into 6.5 hours and 3.5 hours of train and test data-set respectively. The Language Model was developed from the training data-set and for acoustic modelling we used HMM (Monophone and Triphone) based on which we trained a Neural Network based model using Chain CNN-TDNN, achieving up to 5.2% WER with noisy and clean data-set as well as on single word to spontaneous speech data as well.
... As a non-intrusive, automatic, more scalable, and less costly alternative to other screening techniques based on brain scans or blood tests, there has been increasing interest in developing speech-based AD diagnosis systems, in particular during the recent ADReSS challenge [14,15]. For these systems, linguistic features extracted from the elderly speech transcripts play a key role [6,12,[16][17][18][19][20][21][22][23][24][25][26][27][28][29][30]. To this end, accurate recognition of elderly speech recorded during neurocognitive impairment assessment interviews is crucial. ...
... Alzheimer's disease (AD), the most common form of dementia often found in aged people, is characterized by progressive degradation of the memory, cognition, and motor skills, and consequently decline in the speech and language skills of patients [1,2]. Currently, there is no effective cure for AD [3], but a timely intervention approach can delay its progression and reduce the negative physical and mental impact on patients [4]. ...
... Compared with ASR performance measured on non-aged, healthy speech [41,42], significantly higher speech recognition error rates are often found on elderly speech data [2,15,22,23,31,[43][44][45]. In order to mitigate the impact of possible ASR transcript errors on the downstream AD detection task, ASR system combination approaches are also investigated in this paper to account for the uncertainty over of the quality of speech recognition system outputs produced by the hybrid CNN-TDNN and end-to-end Conformer ASR systems considered in this paper. ...
... Alzheimer's disease (AD), the most common form of dementia often found in aged people, is characterized by progressive degradation of the memory, cognition, and motor skills, and consequently decline in the speech and language skills of patients [1,2]. Currently, there is no effective cure for AD [3], but a timely intervention approach can delay its progression and reduce the negative physical and mental impact on patients [4]. ...
... Compared with ASR performance measured on non-aged, healthy speech [43,44], significantly higher speech recognition error rates are often found on elderly speech data [2,15,24,25,33,45,46]. In order to mitigate the impact of possible ASR transcript errors on the downstream AD detection task, ASR system combination approaches are also investigated in this paper to account for the uncertainty over of the quality of speech recognition system outputs produced by the hybrid CNN-TDNN and end-to-end Conformer ASR systems considered in this paper. ...
Preprint
Full-text available
Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care and delay progression. Speech based automatic AD screening systems provide a non-intrusive and more scalable alternative to other clinical screening techniques. Scarcity of such specialist data leads to uncertainty in both model selection and feature learning when developing such systems. To this end, this paper investigates the use of feature and model combination approaches to improve the robustness of domain fine-tuning of BERT and Roberta pre-trained text encoders on limited data, before the resulting embedding features being fed into an ensemble of backend classifiers to produce the final AD detection decision via majority voting. Experiments conducted on the ADReSS20 Challenge dataset suggest consistent performance improvements were obtained using model and feature combination in system development. State-of-the-art AD detection accuracies of 91.67 percent and 93.75 percent were obtained using manual and ASR speech transcripts respectively on the ADReSS20 test set consisting of 48 elderly speakers.