Preprint

Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

Stuttering is a varied speech disorder that harms an individual's communication ability. Persons who stutter (PWS) often use speech therapy to cope with their condition. Improving speech recognition systems for people with such non-typical speech or tracking the effectiveness of speech therapy would require systems that can detect dysfluencies while at the same time being able to detect speech techniques acquired in therapy. This paper shows that fine-tuning wav2vec 2.0 for the classification of stuttering on a sizeable English corpus containing stuttered speech, in conjunction with multi-task learning, boosts the effectiveness of the general-purpose wav2vec 2.0 features for detecting stuttering in speech; both within and across languages. We evaluate our method on Fluencybank and the German therapy-centric Kassel State of Fluency (KSoF) dataset by training Support Vector Machine classifiers using features extracted from the fine-tuned models for six different stuttering-related events types: blocks, prolongations, sound repetitions, word repetitions, interjections, and - specific to therapy - speech modifications. Using embeddings from the fine-tuned models leads to relative classification performance gains up to 27\% w.r.t. F1-score.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Despite the growing interest in unsupervised learning, extracting meaningful knowledge from unlabelled audio remains an open challenge. To take a step in this direction, we recently proposed a problem-agnostic speech encoder (PASE), that combines a con-volutional encoder followed by multiple neural networks, called workers, tasked to solve self-supervised problems (i.e., ones that do not require manual annotations as ground truth). PASE was shown to capture relevant speech information, including speaker voice-print and phonemes. This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments. To this end, we employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances. We then propose a revised encoder that better learns short-and long-term speech dynamics with an efficient combination of recurrent and convolutional networks. Finally, we refine the set of workers used in self-supervision to encourage better cooperation. Results on TIMIT, DIRHA and CHiME-5 show that PASE+ significantly outperforms both the previous version of PASE as well as common acoustic features. Interestingly, PASE+ learns transferable representations suitable for highly mismatched acoustic conditions.
Article
Full-text available
In recent years, many methods have been introduced for supporting the diagnosis of stuttering for automatic detection of prolongation in the speech of people who stutter. However, less attention has been paid to treatment processes in which clients learn to speak more slowly. The aim of this study was to develop a method to help speech-language pathologists (SLPs) during diagnosis and treatment sessions. To this end, speech signals were initially parameterized to perceptual linear predictive (PLP) features. To detect the prolonged segments, the similarities between successive frames of speech signals were calculated based on correlation similarity measures. The segments were labeled as prolongation when the duration of highly similar successive frames exceeded a threshold specified by the speaking rate. The proposed method was evaluated by UCLASS and self-recorded Persian speech databases. The results were also compared with three high-performance studies in automatic prolongation detection. The best accuracies of prolongation detection were 99 and 97.1% for UCLASS and Persian databases, respectively. The proposed method also indicated promising robustness against artificial variation of speaking rate from 70 to 130% of normal speaking rate.
Conference Paper
Full-text available
In order to address the commonly met issue of overfitting in speech recognition, this article investigates Multi-Task Learning, when the auxiliary task focuses on speaker classification. Overfitting occurs when the amount of training data is limited, leading to an over-sensible acoustic model. Multi-Task Learning is a method, among many other regularization methods, which decreases the overfitting impact by forcing the acoustic model to train jointly for multiple different, but related, tasks. In this paper, we consider speaker classification as an auxiliary task in order to improve the generalization abilities of the acoustic model, by training the model to recognize the speaker, or find the closest one inside the training set. We investigate this Multi-Task Learning setup on the TIMIT database, while the acoustic modeling is performed using a Recurrent Neural Network with Long Short-Term Memory cells.
Article
Full-text available
Purpose: This research note gives details of 2 releases of audio recordings available from speakers who stutter that can be accessed on the Web. Method: Most of the recordings are from school-age children. These are available on the University College London Archive of Stuttered Speech (UCLASS) Web site, and information is provided about how to access the site. A description of the recordings and background information about the speakers who contributed recordings to UCLASS Releases One and Two are given. The sample types available in Release One are monologs. Release Two has monologs, readings, and conversations. Three optional software formats that can be used with the archive are described (although processing the archive is not restricted to these formats). Some perceptual assessment of the quality of each recording is given. An assessment of the strengths and limitations of the recording archive is presented. Finally, some past applications and future research possibilities using the recordings are discussed.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
Unlabelled: Epidemiological advances in stuttering during the current century are reviewed within the perspectives of past knowledge. The review is organized in six sections: (a) onset, (b) incidence, (c) prevalence, (d) developmental paths, (e) genetics and (f) subtypes. It is concluded that: (1) most of the risk for stuttering onset is over by age 5, earlier than has been previously thought, with a male-to-female ratio near onset smaller than what has been thought, (2) there are indications that the lifespan incidence in the general population may be higher than the 5% commonly cited in past work, (3) the average prevalence over the lifespan may be lower than the commonly held 1%, (4) the effects of race, ethnicity, culture, bilingualism, and socioeconomic status on the incidence/prevalence of stuttering remain uncertain, (5) longitudinal, as well as incidence and prevalence studies support high levels of natural recovery from stuttering, (6) advances in biological genetic research have brought within reach the identification of candidate genes that contribute to stuttering in the population at large, (7) subtype-differentiation has attracted growing interest, with most of the accumulated evidence supporting a distinction between persistent and recovered subtypes. Educational objectives: Readers will be exposed to a summary presentation of the most recent data concerning basic epidemiological factors in stuttering. Most of these factors also pertain to children's risks for experiencing stuttering onset, as well as risks for persistency. The article also aims to increase awareness of the implications of the information to research, and professional preparation that meets the epidemiology of the disorder.
Article
The fact that stuttering runs in families has been documented over a long period and has led to speculations and research about the role of a genetic component to this disorder. Although the genetic factor cannot be proved by familial aggregation and twin studies alone, such research has continued to provide support for a relationship between stuttering and genetics. The purposes of this article are to review and critique the research in this area. The article first assesses research methodologies that have been employed in familial studies of stuttering. It proceeds to review and critique incidence, twin, and aggregation studies. In addition, it includes sections on subgroups, genetic models of stuttering, and implications for future research as well as for clinical work. With a focus on improved methodology and recent findings, a current perspective on our knowledge of the genetic component to stuttering is provided. Among other conclusions, the article emphasizes that failure to consider epidemiologic factors has probably biased previous results regarding the genetics of stuttering. New preliminary data also appear to provide evidence that spontaneous recovery and chronicity are influenced by genetic factors. Generally, however, the review of incidence and twin studies, as well as of evidence for the various inheritance models, confirms previous conclusions about the interaction between genetic and environmental factors in stuttering.
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
  • A Baevski
  • Y Zhou
  • A Mohamed
  • M Auli
A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations," in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 12 449-12 460.
KSoF: The Kassel State of Fluency Dataset -A Therapy Centered Dataset of Stuttering
  • S P Bayerl
  • A W Gudenberg
  • F Hönig
  • E Nöth
  • K Riedhammer
S. P. Bayerl, A. W. von Gudenberg, F. Hönig, E. Nöth, and K. Riedhammer, "KSoF: The Kassel State of Fluency Dataset -A Therapy Centered Dataset of Stuttering," arXiv:2203.05383 [cs, eess], Mar. 2022, arXiv: 2203.05383. [Online]. Available: http://arxiv.org/abs/2203.05383
Fattori sociali e biologici nella variazione fonetica
  • R Lickley
R. Lickley, "Disfluency in typical and stuttered speech," Fattori sociali e biologici nella variazione fonetica, no. 3, p. 373, 2017. [Online]. Available: https://doi.org/10.17469/O2103AISV000019
Towards automated assessment of stuttering and stuttering therapy
  • S P Bayerl
S. P. Bayerl, "Towards automated assessment of stuttering and stuttering therapy," in International Conference on Text, Speech, and Dialogue. Springer, Cham, 2020, pp. 386-396.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  • J Devlin
  • M.-W Chang
  • K Lee
  • K Toutanova
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv:1810.04805 [cs], May 2019, arXiv: 1810.04805. [Online]. Available: http://arxiv.org/abs/1810.04805
Unsupervised Speech Recognition
  • A Baevski
  • W.-N Hsu
  • A Conneau
  • M Auli
A. Baevski, W.-N. Hsu, A. Conneau, and M. Auli, "Unsupervised Speech Recognition," arXiv:2105.11084 [cs, eess], Oct. 2021, arXiv: 2105.11084. [Online]. Available: http://arxiv.org/abs/ 2105.11084
  • S A Sheikh
  • M Sahidullah
  • F Hirsch
  • S Ouni
S. A. Sheikh, M. Sahidullah, F. Hirsch, and S. Ouni, "Machine Learning for Stuttering Identification: Review, Challenges & Future Directions," arXiv:2107.04057 [cs, eess], Jul. 2021, arXiv: 2107.04057. [Online]. Available: http://arxiv.org/abs/2107.04057
A Training Algorithm for Optimal Margin Classifiers
  • B E Boser
  • I M Guyon
  • V N Vapnik
B. E. Boser, I. M. Guyon, and V. N. Vapnik, "A Training Algorithm for Optimal Margin Classifiers," in Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory. ACM Press, 1992, pp. 144-152.
Decoupled Weight Decay Regularization
  • I Loshchilov
  • F Hutter
I. Loshchilov and F. Hutter, "Decoupled Weight Decay Regularization," arXiv:1711.05101 [cs, math], Jan. 2019, arXiv: 1711.05101. [Online]. Available: http://arxiv.org/abs/1711.05101
Die Phonetik vonäh undähm: Akustische Variation von Füllpartikeln im Deutschen
  • M Belz
M. Belz, Die Phonetik vonäh undähm: Akustische Variation von Füllpartikeln im Deutschen. Berlin, Heidelberg: Springer Berlin Heidelberg, 2021. [Online]. Available: http://link.springer. com/10.1007/978-3-662-62812-6