Dominik Wagner's research while affiliated with Technische Hochschule Nürnberg Georg Simon Ohm and other places

Publications (26)

Preprint
With the advancement of multimedia technologies, news documents and user-generated content are often represented as multiple modalities, making Multimedia Event Extraction (MEE) an increasingly important challenge. However, recent MEE methods employ weak alignment strategies and data augmentation with simple classification models, which ignore the...
Preprint
In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial portions of the intermediate matrices necessary for speculative sampling can be computed concurrently. This allows us to distribute the workload across multiple GPU threads, enabling simultaneous operations on matr...
Preprint
This paper explores the improvement of post-training quantization (PTQ) after knowledge distillation in the Whisper speech foundation model family. We address the challenge of outliers in weights and activation tensors, known to impede quantization quality in transformer-based language and vision models. Extending this observation to Whisper, we de...
Preprint
Accurately detecting dysfluencies in spoken language can help to improve the performance of automatic speech and language processing components and support the development of more inclusive speech and language technologies. Inspired by the recent trend towards the deployment of large language models (LLMs) as universal learners and processors of no...
Preprint
Cycle-consistent generative adversarial networks have been widely used in non-parallel voice conversion (VC). Their ability to learn mappings between source and target features without relying on parallel training data eliminates the need for temporal alignments. However, most methods decouple the conversion of acoustic features from synthesizing t...
Preprint
Most stuttering detection and classification research has viewed stuttering as a multi-class classification problem or a binary detection task for each dysfluency type; however, this does not match the nature of stuttering, in which one dysfluency seldom comes alone but rather co-occurs with others. This paper explores multi-language and cross-corp...
Article
Vocal fatigue refers to the feeling of tiredness and weakness of voice due to extended utilization. This paper investigates the effectiveness of neural embeddings for the detection of vocal fatigue. We compare x-vectors, ECAPA-TDNN, and wav2vec 2.0 embeddings on a corpus of academic spoken English. Low-dimensional mappings of the data reveal that n...
Preprint
This work adapts two recent architectures of generative models and evaluates their effectiveness for the conversion of whispered speech to normal speech. We incorporate the normal target speech into the training criterion of vector-quantized variational autoencoders (VQ-VAEs) and MelGANs, thereby conditioning the systems to recover voiced speech fr...
Preprint
We analyze the impact of speaker adaptation in end-to-end architectures based on transformers and wav2vec 2.0 under different noise conditions. We demonstrate that the proven method of concatenating speaker vectors to the acoustic features and supplying them as an auxiliary model input remains a viable option to increase the robustness of end-to-en...
Preprint
Current findings show that pre-trained wav2vec 2.0 models can be successfully used as feature extractors to discriminate on speaker-based tasks. We demonstrate that latent representations extracted at different layers of a pre-trained wav2vec 2.0 system can be effectively used for binary classification of various types of pathologic speech. We exam...
Preprint
Specially adapted speech recognition models are necessary to handle stuttered speech. For these to be used in a targeted manner, stuttered speech must be reliably detected. Recent works have treated stuttering as a multi-class classification problem or viewed detecting each dysfluency type as an isolated task; that does not capture the nature of st...
Preprint
The detection of pathologies from speech features is usually defined as a binary classification task with one class representing a specific pathology and the other class representing healthy speech. In this work, we train neural networks, large margin classifiers, and tree boosting machines to distinguish between four different pathologies: Parkins...
Chapter
This paper empirically investigates the influence of different data splits and splitting strategies on the performance of dysfluency detection systems. For this, we perform experiments using wav2vec 2.0 models with a classification head as well as support vector machines (SVM) in conjunction with the features extracted from the wav2vec 2.0 model to...
Preprint
This work aims to automatically evaluate whether the language development of children is age-appropriate. Validated speech and language tests are used for this purpose to test the auditory memory. In this work, the task is to determine whether spoken nonwords have been uttered correctly. We compare different approaches that are motivated to model s...
Preprint
This paper empirically investigates the influence of different data splits and splitting strategies on the performance of dysfluency detection systems. For this, we perform experiments using wav2vec 2.0 models with a classification head as well as support vector machines (SVM) in conjunction with the features extracted from the wav2vec 2.0 model to...
Preprint
Vocal fatigue refers to the feeling of tiredness and weakness of voice due to extended utilization. This paper investigates the effectiveness of neural embeddings for the detection of vocal fatigue. We compare x-vectors, ECAPA-TDNN, and wav2vec 2.0 embeddings on a corpus of academic spoken English. Low-dimensional mappings of the data reveal that n...
Preprint
Stuttering is a varied speech disorder that harms an individual's communication ability. Persons who stutter (PWS) often use speech therapy to cope with their condition. Improving speech recognition systems for people with such non-typical speech or tracking the effectiveness of speech therapy would require systems that can detect dysfluencies whil...

Citations

... Most studies employ spectral features like MFCCs or coefficients derived from linear predictive coding methods [7,8]. Recent work has explored neural representations obtained from acoustic encoder models [9,10,11,12], in particular wav2vec 2.0 [13]. These representations are then used as input features to various classifiers ranging from support vector machines to neural networks. ...
... It has been shown speech representations capture human perceptual understanding [11] and preserve consistent attributes within the speech such as speaker, language, emotion, and age. As speech contains rich information about the conditions of several important organs, with the rise of these models, there have been several works exploring and evaluating their potential for identifying disease [12] [13]. However, deep learning models lack interpretability, which hinders their applications in healthcare sectors. ...
... The study of speech biomarkers in mental health holds great potential, offering a non-invasive and easily accessible avenue to capture significant motor, cognitive and behavioral changes due to mental health disorders such as depression and anxiety [10][11][12][13]. Clinical evidence and research studies have increasingly linked specific automated extracted speech features, such as prosody, articulation, and fluency, with various mental health conditions, including depression [10,14], anxiety [15], suicide-risk assessment [16], fatigue [17,18], or sleep deprivation [19]. The complexity of human speech extends beyond the intricate motor coordination involved. ...
... However, these approaches model cepstral and F 0 features with separate GANs and require additional vocoders for speech synthesis. Other methods for whispered speech conversion use the attention mechanism [23] to learn time alignments between utterance pairs and train models on data that was aligned via dynamic time warping (DTW) prior to training [24,25]. Some studies avoid time alignment by generating artificial whispers from voiced input speech [26] or by representing the speech content in latent space before decoding it back to a mel-spectrogram representation [27]. ...
... Most studies employ spectral features like MFCCs or coefficients derived from linear predictive coding methods [7,8]. Recent work has explored neural representations obtained from acoustic encoder models [9,10,11,12], in particular wav2vec 2.0 [13]. These representations are then used as input features to various classifiers ranging from support vector machines to neural networks. ...
... Examined metadata age, gender, recording location, accent, and spoken text. Furthermore, based on previous works [9,20,21], we assume that the lower encoder blocks contain low-level features; the higher the layer, the more content and phonetic information these contain. For this reason, we aggregate three layers at a time, calculating the average over the lower layers (1-3), middle layers (4-6 and 7-9), and higher layers (10)(11)(12). ...
... Most studies employ spectral features like MFCCs or coefficients derived from linear predictive coding methods [7,8]. Recent work has explored neural representations obtained from acoustic encoder models [9,10,11,12], in particular wav2vec 2.0 [13]. These representations are then used as input features to various classifiers ranging from support vector machines to neural networks. ...