Fig 2 - uploaded by Stefano Squartini
Content may be subject to copyright.
Structure of the Convolutional Neural Network applied for mVAD.  

Structure of the Convolutional Neural Network applied for mVAD.  

Source publication
Conference Paper
Full-text available
This paper focuses on Voice Activity Detectors (VAD) for multi-room domestic scenarios based on deep neural network architectures. Interesting advancements are observed with respect to a previous work. A comparative and extensive analysis is lead among four different neural networks (NN). In particular, we exploit Deep Belief Network (DBN), Multi-L...

Contexts in source publication

Context 1
... to introduce robustness against translations of the input patterns. Finally, at the top of the network, a layer of neurons is applied. This layer does not differ from MLP, being composed by a set of activation and being fully connected with the previous layer. For clarity, we will refer to the units contained in this layer as Hidden Nodes (HN). Fig. 2 shows the structure of the CNN applied to the mVAD ...
Context 2
... exploits the temporal evolution of the signal [16]. In our case study, the time context is created by concatenating the feature vectors of a certain amount of consecutive frames. This yields a 2-D matrix of feature values related to a single room. Then, the final input matrix is obtained by stacking the single room matrices, as depicted in Fig. ...

Citations

... When processing audio data, multiple challenges arise, one of them being the diversity of information present in the audio signal. In the literature, segmentation tasks are often completed by separate bi-directional recurrent or convolutional models [6,7], or Temporal Convolutional Network (TCN) [8][9][10] trained on different datasets, thus increasing computational costs and limiting the usage to specific datasets. Additionally, an OSD convolutional model [11] deals with this problem from a multiclass perspective, while a modified version of the end-to-end diarization (EEND) approach [12] is based on the multilabel paradigm. ...
Preprint
Full-text available
Audio segmentation is a key task for many speech technologies, most of which are based on neural networks, usually considered as black boxes, with high-level performances. However, in many domains, among which health or forensics, there is not only a need for good performance but also for explanations about the output decision. Explanations derived directly from latent representations need to satisfy "good" properties, such as informativeness, compactness, or modularity, to be interpretable. In this article, we propose an explainable-by-design audio segmentation model based on non-negative matrix factorization (NMF) which is a good candidate for the design of interpretable representations. This paper shows that our model reaches good segmentation performances, and presents deep analyses of the latent representation extracted from the non-negative matrix. The proposed approach opens new perspectives toward the evaluation of interpretable representations according to "good" properties.
... The research presented in [145] implements a SAD system based on a multilayer perceptron with energy efficiency as the main concern. A deep neural network (DNN) approach is used in [146] to perform SAD in a multi-room environment. In [147], new optimisation techniques based on the area under the ROC curve are explored in the framework of a deep learning SAD system. ...
Thesis
Full-text available
Advances in technology over the last decade have reshaped the way population interact with multimedia content. This fact aroused a significant rise both in generation and consumption of these data in recent years. Manual analysis and annotation of this information is unfeasible given the current volume, revealing the necessity of automatic tools that can help advance from manual working pipelines to assisted or partially automatic practices. Over the last few years, most of these tools for multimedia information retrieval are based on the deep learning paradigm. In this context, work presented in this thesis focuses on the audio information retrieval domain. In particular, this dissertation studies the audio segmentation task, whose main goal is to provide a sequence of labels that isolates different regions in an input audio signal according to the characteristics described in a predefined set of classes, e.g., speech, music or noise. This study has mainly focused on two important topics: data availability and generalisation. For the first one, part of the work presented in this thesis has investigated ways to improve the performance of audio segmentation systems even when the training datasets are limited in size. Concerning generalisation, some experiments performed aimed to train robust audio segmentation models that can work in different domain conditions. Research efforts presented in this thesis dissertation have been centred around three main areas: speech activity detection in challenging environments, multiclass audio segmentation, and AUC optimisation for audio segmentation:
... Після навчання на навчальному комплексі переходимо до тестування. 4. Розрахунок. ...
... The research presented in [7] implements a SAD system based on a multilayer perceptron with energy efficiency as the main concern. A deep neural network (DNN) approach is used in [8] to perform SAD in a multi-room environment. In [9], new optimisation techniques based on the area under the ROC curve are explored in the framework of a deep learning SAD system. ...
Article
Full-text available
Speech activity detection (SAD) aims to accurately classify audio fragments containing human speech. Current state-of-the-art systems for the SAD task are mainly based on deep learning solutions. These applications usually show a significant drop in performance when test data are different from training data due to the domain shift observed. Furthermore, machine learning algorithms require large amounts of labelled data, which may be hard to obtain in real applications. Considering both ideas, in this paper we evaluate three unsupervised domain adaptation techniques applied to the SAD task. A baseline system is trained on a combination of data from different domains and then adapted to a new unseen domain, namely, data from Apollo space missions coming from the Fearless steps challenge. Experimental results demonstrate that domain adaptation techniques seeking to minimise the statistical distribution shift provide the most promising results. In particular, Deep CORAL method reports a 13% relative improvement in the original evaluation metric when compared to the unadapted baseline model. Further experiments show that the cascaded application of Deep CORAL and pseudo-labelling techniques can improve even more the results, yielding a significant 24% relative improvement in the evaluation metric when compared to the baseline system.
... Traditional VAD algorithms were developed to detect speech in close-talk telephony with relatively high SNR, which however decreases with increased speaker distance, reverberation, and presence of external sounds [5]. Deep neural networks (DNNs) have resulted in state-of-the art performance in VAD accuracy over past years in challenging conditions [7]- [18], and have also been applied for online processing using fully connected DNNs [11], recurrent neural networks (RNNs) [13] and convolutional neural networks (CNNs) [7] -based approaches. The specific task of own voice detection (OVD) using a wearable device has received little attention in recent works with DNN type VAD apart from [17], where OVD is used for offline keyword detection in a hearing aid device. ...
... The authors of [18] consider the task of multi-room VAD, where different DNN topologies (deep belief network (DBN) [50], DNN, Bidirectional LSTM (BLSTM), and CNN) were contrasted by feeding concatenated features (e.g. pitch, MFCCs) extracted from the microphones located in two rooms with speech activity to be detected. ...
Article
Full-text available
Voice activity detection (VAD) aims for detecting the presence of speech in a given input signal, and is often the first step in voice -based applications such as speech communication systems. In the context of personal devices, own voice detection (OVD) is a sub-task of VAD, since it targets speech detection of the person wearing the device, while ignoring other speakers in the presence of interference signals. This article first summarizes recent single and multi-microphone, multi-sensor, and hearing aids related VAD techniques. Then, a wearable in-ear device equipped with multiple microphones and an accelerometer is investigated for the OVD task using a neural network with input embedding and long short-term memory (LSTM) layers. The device picks up the user’s speech signal through air as well as vibrations through the body. However, besides external sounds the device is sensitive to user’s own non-speech vocal noises (e.g. coughing, yawning, etc.) and movement noise caused by physical activities. A signal mixing model is proposed to produce databases of noisy observations used for training and testing the frame-by-frame OVD method. The best model’s performance is further studied in the presence of different recorded interference. An ablation study reports the model’s performance on sub-sets of sensors. The results show that the OVD approach is robust towards both user motion and user generated vocal non-speech sounds in the presence of loud external interference. The approach is suitable for real-time operation and achieves 90-96 % OVD accuracy in challenging use scenarios with a short 10 ms processing frame length.
... The research presented in [7] implements a SAD system based on a multilayer perceptron with energy efficiency as the main concern. A deep neural network approach is used in [8] to perform SAD in a multi-room environment. In [9], new optimisation techniques based on the area under the ROC curve are explored in the framework of a deep learning SAD system. ...
... Dealing with increasing numbers of features becomes more difficult with human-crafted models, often resulting in diminishing performance gains. Therefore, data-driven approaches, especially neural networks [14]- [16], are an attractive choice and have shown substantial performance boosts [17]. ...
Conference Paper
Full-text available
The task of voice activity detection (VAD) is an often required module in various speech processing, analysis and classification tasks. While state-of-the-art neural network based VADs can achieve great results, they often exceed computational budgets and real-time operating requirements. In this work, we propose a computationally efficient real-time VAD network that achieves state-of-the-art results on several public real recording datasets. We investigate different training targets for the VAD and show that using the segmental voice-to-noise ratio (VNR) is a better and more noise-robust training target than the clean speech level based VAD. We also show that multi-target training improves the performance further.
... The research presented in [7] implements a SAD system based on a multilayer perceptron with energy efficiency as the main concern. A deep neural network approach is used in [8] to perform SAD in a multi-room environment. In [9], new optimisation techniques based on the area under the ROC curve are explored in the framework of a deep learning SAD system. ...
Conference Paper
Full-text available
Speech Activity Detection (SAD) aims to correctly distinguish audio segments containing human speech. Several solutions have been successfully applied to the SAD task, with deep learning approaches being specially relevant nowadays. This paper describes a SAD solution based on Convolutional Recurrent Neural Networks (CRNN) presented as the ViVoLab submission to the 2020 Fearless steps challenge. The dataset used comes from the audio of Apollo space missions, presenting a challenging domain with strong degradation and several transmission noises. First, we explore the performance of 1D and 2D convolutional processing stages. Then we propose a novel architecture that executes the fusion of two convolutional feature maps by combining the information captured with 1D and 2D filters. Obtained results largely outperform the baseline provided by the organisation. They were able to achieve a detection cost function below 2% on the development set for all configurations. Best results were reported on the presented fusion architecture, with a DCF metric of 1.78% on the evaluation set and ranking fourth among all the participant teams in the challenge SAD task.
... Dealing with increasing numbers of features becomes more difficult with human-crafted models, often resulting in diminishing performance gains. Therefore, data-driven approaches, especially neural networks [14]- [16], are an attractive choice and have shown substantial performance boosts [17]. ...
Preprint
The task of voice activity detection (VAD) is an often required module in various speech processing, analysis and classification tasks. While state-of-the-art neural network based VADs can achieve great results, they often exceed computational budgets and real-time operating requirements. In this work, we propose a computationally efficient real-time VAD network that achieves state-of-the-art results on several public real recording datasets. We investigate different training targets for the VAD and show that using the segmental voice-to-noise ratio (VNR) is a better and more noise-robust training target than the clean speech level based VAD. We also show that multi-target training improves the performance further.
... As state transition in STM occurs based on VAD's decision (speech or non-speech), the performance of VAD is crucial for EPD. Recently, deep-learning-based VADs using deep neural network (DNN) [8][9], recurrent neural network (RNN) [10][11][12] and convolutional neural network (CNN) [13] have been outperformed by the conventional VADs [14][15][16]. However, VAD error always can occur even in clean environment although we use state-of-the-art VAD, which causes undesired state transition of STM. ...
Preprint
A state transition model (STM) based on chunk-wise classification was proposed for end-point detection (EPD). In general, EPD is developed using frame-wise voice activity detection (VAD) with additional STM, in which the state transition is conducted based on VAD's frame-level decision (speech or non-speech). However, VAD errors frequently occur in noisy environments, even though we use state-of-the-art deep neural network based VAD, which causes the undesired state transition of STM. In this work, to build robust STM, a state transition is conducted based on chunk-wise classification as EPD does not need to be conducted in frame-level. The chunk consists of multiple frames and the classification of chunk between speech and non-speech is done by aggregating the decisions of VAD for multiple frames, so that some undesired VAD errors in a chunk can be smoothed by other correct VAD decisions. Finally, the model was evaluated in both qualitative and quantitative measures including phone error rate.