Structure of the Convolutional Neural Network applied for mVAD.

Source publication

Deep Neural Networks for Multi-Room Voice Activity Detection: Advancements and Comparative Evaluation

Conference Paper

Full-text available

Jul 2016

This paper focuses on Voice Activity Detectors (VAD) for multi-room domestic scenarios based on deep neural network architectures. Interesting advancements are observed with respect to a previous work. A comparative and extensive analysis is lead among four different neural networks (NN). In particular, we exploit Deep Belief Network (DBN), Multi-L...

Context 1

... to introduce robustness against translations of the input patterns. Finally, at the top of the network, a layer of neurons is applied. This layer does not differ from MLP, being composed by a set of activation and being fully connected with the previous layer. For clarity, we will refer to the units contained in this layer as Hidden Nodes (HN). Fig. 2 shows the structure of the CNN applied to the mVAD ...

View in full-text

Context 2

... exploits the temporal evolution of the signal [16]. In our case study, the time context is created by concatenating the feature vectors of a certain amount of consecutive frames. This yields a 2-D matrix of feature values related to a single room. Then, the final input matrix is obtained by stacking the single room matrices, as depicted in Fig. ...

View in full-text

Explainable by-design Audio Segmentation through Non-Negative Matrix Factorization and Probing

Preprint

Full-text available

Jun 2024

Audio segmentation is a key task for many speech technologies, most of which are based on neural networks, usually considered as black boxes, with high-level performances. However, in many domains, among which health or forensics, there is not only a need for good performance but also for explanations about the output decision. Explanations derived directly from latent representations need to satisfy "good" properties, such as informativeness, compactness, or modularity, to be interpretable. In this article, we propose an explainable-by-design audio segmentation model based on non-negative matrix factorization (NMF) which is a good candidate for the design of interpretable representations. This paper shows that our model reaches good segmentation performances, and presents deep analyses of the latent representation extracted from the non-negative matrix. The proposed approach opens new perspectives toward the evaluation of interpretable representations according to "good" properties.

Advances in Binary and Multiclass Audio Segmentation with Deep Learning Techniques

Thesis

Full-text available

May 2023

Pablo Gimeno

Advances in technology over the last decade have reshaped the way population interact with multimedia content. This fact aroused a significant rise both in generation and consumption of these data in recent years. Manual analysis and annotation of this information is unfeasible given the current volume, revealing the necessity of automatic tools that can help advance from manual working pipelines to assisted or partially automatic practices. Over the last few years, most of these tools for multimedia information retrieval are based on the deep learning paradigm. In this context, work presented in this thesis focuses on the audio information retrieval domain. In particular, this dissertation studies the audio segmentation task, whose main goal is to provide a sequence of labels that isolates different regions in an input audio signal according to the characteristics described in a predefined set of classes, e.g., speech, music or noise. This study has mainly focused on two important topics: data availability and generalisation. For the first one, part of the work presented in this thesis has investigated ways to improve the performance of audio segmentation systems even when the training datasets are limited in size. Concerning generalisation, some experiments performed aimed to train robust audio segmentation models that can work in different domain conditions. Research efforts presented in this thesis dissertation have been centred around three main areas: speech activity detection in challenging environments, multiclass audio segmentation, and AUC optimisation for audio segmentation:

VOICE ACTIVITY DETECTION IN LOW-POWER SYSTEMS

Article

Jan 2023

Unsupervised Adaptation of Deep Speech Activity Detection Models to Unseen Domains

Article

Full-text available

Feb 2022

Speech activity detection (SAD) aims to accurately classify audio fragments containing human speech. Current state-of-the-art systems for the SAD task are mainly based on deep learning solutions. These applications usually show a significant drop in performance when test data are different from training data due to the domain shift observed. Furthermore, machine learning algorithms require large amounts of labelled data, which may be hard to obtain in real applications. Considering both ideas, in this paper we evaluate three unsupervised domain adaptation techniques applied to the SAD task. A baseline system is trained on a combination of data from different domains and then adapted to a new unseen domain, namely, data from Apollo space missions coming from the Fearless steps challenge. Experimental results demonstrate that domain adaptation techniques seeking to minimise the statistical distribution shift provide the most promising results. In particular, Deep CORAL method reports a 13% relative improvement in the original evaluation metric when compared to the unadapted baseline model. Further experiments show that the cascaded application of Deep CORAL and pseudo-labelling techniques can improve even more the results, yielding a significant 24% relative improvement in the evaluation metric when compared to the baseline system.

Online Own Voice Detection for a Multi-Channel Multi-Sensor In-Ear Device

Article

Full-text available

Oct 2021
IEEE SENS J

Voice activity detection (VAD) aims for detecting the presence of speech in a given input signal, and is often the first step in voice -based applications such as speech communication systems. In the context of personal devices, own voice detection (OVD) is a sub-task of VAD, since it targets speech detection of the person wearing the device, while ignoring other speakers in the presence of interference signals. This article first summarizes recent single and multi-microphone, multi-sensor, and hearing aids related VAD techniques. Then, a wearable in-ear device equipped with multiple microphones and an accelerometer is investigated for the OVD task using a neural network with input embedding and long short-term memory (LSTM) layers. The device picks up the user’s speech signal through air as well as vibrations through the body. However, besides external sounds the device is sensitive to user’s own non-speech vocal noises (e.g. coughing, yawning, etc.) and movement noise caused by physical activities. A signal mixing model is proposed to produce databases of noisy observations used for training and testing the frame-by-frame OVD method. The best model’s performance is further studied in the presence of different recorded interference. An ablation study reports the model’s performance on sub-sets of sensors. The results show that the OVD approach is robust towards both user motion and user generated vocal non-speech sounds in the presence of loud external interference. The approach is suitable for real-time operation and achieves 90-96 % OVD accuracy in challenging use scenarios with a short 10 ms processing frame length.

Unsupervised Representation Learning for Speech Activity Detection in the Fearless Steps Challenge 2021

Conference Paper

Full-text available

Aug 2021

On training targets for noise-robust voice activity detection

Conference Paper

Full-text available

Aug 2021

The task of voice activity detection (VAD) is an often required module in various speech processing, analysis and classiﬁcation tasks. While state-of-the-art neural network based VADs can achieve great results, they often exceed computational budgets and real-time operating requirements. In this work, we propose a computationally efﬁcient real-time VAD network that achieves state-of-the-art results on several public real recording datasets. We investigate different training targets for the VAD and show that using the segmental voice-to-noise ratio (VNR) is a better and more noise-robust training target than the clean speech level based VAD. We also show that multi-target training improves the performance further.

Convolutional Recurrent Neural Networks for Speech Activity Detection in Naturalistic Audio from Apollo Missions

Conference Paper

Full-text available

Mar 2021

Speech Activity Detection (SAD) aims to correctly distinguish audio segments containing human speech. Several solutions have been successfully applied to the SAD task, with deep learning approaches being specially relevant nowadays. This paper describes a SAD solution based on Convolutional Recurrent Neural Networks (CRNN) presented as the ViVoLab submission to the 2020 Fearless steps challenge. The dataset used comes from the audio of Apollo space missions, presenting a challenging domain with strong degradation and several transmission noises. First, we explore the performance of 1D and 2D convolutional processing stages. Then we propose a novel architecture that executes the fusion of two convolutional feature maps by combining the information captured with 1D and 2D filters. Obtained results largely outperform the baseline provided by the organisation. They were able to achieve a detection cost function below 2% on the development set for all configurations. Best results were reported on the presented fusion architecture, with a DCF metric of 1.78% on the evaluation set and ranking fourth among all the participant teams in the challenge SAD task.

On training targets for noise-robust voice activity detection

Preprint

Feb 2021

The task of voice activity detection (VAD) is an often required module in various speech processing, analysis and classification tasks. While state-of-the-art neural network based VADs can achieve great results, they often exceed computational budgets and real-time operating requirements. In this work, we propose a computationally efficient real-time VAD network that achieves state-of-the-art results on several public real recording datasets. We investigate different training targets for the VAD and show that using the segmental voice-to-noise ratio (VNR) is a better and more noise-robust training target than the clean speech level based VAD. We also show that multi-target training improves the performance further.

End-Point Detection with State Transition Model based on Chunk-Wise Classification

Preprint

Dec 2019

A state transition model (STM) based on chunk-wise classification was proposed for end-point detection (EPD). In general, EPD is developed using frame-wise voice activity detection (VAD) with additional STM, in which the state transition is conducted based on VAD's frame-level decision (speech or non-speech). However, VAD errors frequently occur in noisy environments, even though we use state-of-the-art deep neural network based VAD, which causes the undesired state transition of STM. In this work, to build robust STM, a state transition is conducted based on chunk-wise classification as EPD does not need to be conducted in frame-level. The chunk consists of multiple frames and the classification of chunk between speech and non-speech is done by aggregating the decisions of VAD for multiple frames, so that some undesired VAD errors in a chunk can be smoothed by other correct VAD decisions. Finally, the model was evaluated in both qualitative and quantitative measures including phone error rate.

Structure of the Convolutional Neural Network applied for mVAD.

Contexts in source publication

Citations