Home
Concordia University Montreal
Department of Computer Science and Software Engineering
Mirco Ravanelli

Mirco Ravanelli
Concordia University Montreal · Department of Computer Science and Software Engineering

PhD

Deep learning, conversational AI, and EEG signal processing. Creator of SpeechBrain (https://speechbrain.github.io/)

About

112

Publications

46,491

Reads

4,354

Citations

I'm an Assistant Professor at Concordia University (Gina Cody School of Engineering and Computer Science) working on deep learning for sequence processing, with a focus on Conversational AI. I'm also adjunct professor at the Université de Montréal (DIRO - Département d'informatique et de recherche opérationnelle) and Associate Member at Mila - Quebec AI Institute. I'm the author of more than 50 papers on deep learning, conversational AI, and EEG signal processing.

Skills and Expertise

Signal Processing

Digital Signal Processing

Machine Learning

Pattern Recognition

FFT

Time-Frequency Analysis

Speech Processing

Feature Extraction

Speech Science

Advanced Machine Learning

January 2018 - present

Université de Montréal

Montreal Institute for Learning Algorithms (MILA)
Canada

Position

PostDoc Position

Description

PostDoc at MILA working on deep learning for speech recognition

April 2016 - October 2016

Université de Montréal

Canada

Position

PhD Student

Description

Deep Learning for distant speech recognition

January 2013 - May 2013

University of California, Berkeley

Position

Visiting Researcher

September 2005 - February 2011

Università degli Studi di Trento

Field of study

Publications

Key Features of the Discrete Audio Encoders. #Params is computed for...

Benchmarking results for discriminative tasks.

(left) Evaluating various discrete decoders on the speech re-synthesis...

Features of the Considered Discrete Audio Encoders.

DASB -- Discrete Audio and Speech Benchmark

Preprint

Full-text available

Jun 2024

Discrete audio tokens have recently gained considerable attention for their potential to connect audio and language processing, enabling the creation of modern multimodal large language models. Ideal audio tokens must effectively preserve phonetic and semantic content along with paralinguistic information, speaker identity, and other details. While...

Figure 1: The proposed method for audio token extraction from SSL...

Out-of-domain and in-domain performance of discrete HuBERT and WavLM...

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Preprint

Full-text available

Jun 2024

Discrete audio tokens have recently gained attention for their potential to bridge the gap between audio and language processing. Ideal audio tokens must preserve content, paralinguistic elements, speaker identity, and many other audio details. Current audio tokenization methods fall into two categories: Semantic tokens, acquired through quantizati...

Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice

Preprint

Jun 2024

In this paper, we propose Phoneme Discretized Saliency Maps (PDSM), a discretization algorithm for saliency maps that takes advantage of phoneme boundaries for explainable detection of AI-generated voice. We experimentally show with two different Text-to-Speech systems (i.e., Tacotron2 and Fastspeech2) that the proposed algorithm produces saliency...

Listenable Maps for Zero-Shot Audio Classifiers

Preprint

May 2024

Interpreting the decisions of deep learning models, including audio classifiers, is crucial for ensuring the transparency and trustworthiness of this technology. In this paper, we introduce LMAC-ZS (Listenable Maps for Audio Classifiers in the Zero-Shot context), which, to the best of our knowledge, is the first decoder-based post-hoc interpretatio...

Resource-Efficient Separation Transformer

Conference Paper

Apr 2024

Speech Emotion Diarization: Which Emotion Appears When?

Conference Paper

Dec 2023

TorchAudio 2.1: Advancing Speech Recognition, Self-Supervised Learning, and Audio Processing Components for Pytorch

Conference Paper

Dec 2023

Rescuespeech: A German Corpus for Speech Recognition in Search and Rescue Domain

Conference Paper

Dec 2023

CL-MASR: A Continual Learning Benchmark for Multilingual ASR

Preprint

Full-text available

Oct 2023

p>This paper introduces Continual Learning for Multilingual ASR (CL-MASR), a benchmark for continual learning applied to multilingual ASR. CL-MASR offers a curated selection of medium/low-resource languages, a modular and flexible platform for executing and evaluating various CL methods on top of existing large-scale pretrained multilingual ASR mod...

CL-MASR: A Continual Learning Benchmark for Multilingual ASR

Preprint

Full-text available

Oct 2023

Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads

Preprint

Aug 2023

Self-supervised learning (SSL) leverages large datasets of unlabeled speech to reach impressive performance with reduced amounts of annotated data. The high number of proposed approaches fostered the emergence of comprehensive benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. How...

Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?

Conference Paper

Aug 2023

Speech Emotion Diarization: Which Emotion Appears When?

Preprint

Full-text available

Jun 2023

Speech Emotion Recognition (SER) typically relies on utterance-level solutions. However, emotions conveyed through speech should be considered as discrete speech events with definite temporal boundaries, rather than attributes of the entire utterance. To reflect the fine-grained nature of speech emotions, we propose a new task: Speech Emotion Diari...

Fine-Tuning Strategies for Faster Inference Using Speech Self-Supervised Models: A Comparative Study

Conference Paper

Jun 2023

Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?

Preprint

Jun 2023

Self-supervised learning (SSL) has recently allowed leveraging large datasets of unlabeled speech signals to reach impressive performance on speech tasks using only small amounts of annotated data. The high number of proposed approaches fostered the need and rise of extended benchmarks that evaluate their performance on a set of downstream tasks ex...

Posthoc Interpretation via Quantization

Preprint

Mar 2023

In this paper, we introduce a new approach, called "Posthoc Interpretation via Quantization (PIQ)", for interpreting decisions made by trained classifiers. Our method utilizes vector quantization to transform the representations of a classifier into a discrete, class-specific latent space. The class-specific codebooks act as a bottleneck that force...

Fine-tuning Strategies for Faster Inference using Speech Self-Supervised Models: A Comparative Study

Preprint

Mar 2023

Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance in low-resource settings. In this context, it has been demonstrated that larger self-supervised feature extractors are crucial for achieving lower downstream ASR error rates. Thus, better performance might be sanctioned with longer infe...

Exploring Self-Attention Mechanisms for Speech Separation

Article

Jan 2023

Transformers have enabled impressive improvements in deep learning. They often outperform recurrent and convolutional models in many tasks while taking advantage of parallel processing. Recently, we proposed the SepFormer, which obtains state-of-the-art performance in speech separation with the WSJ0-2/3 Mix datasets. This paper studies in-depth Tra...

OSSEM: one-shot speaker adaptive speech enhancement using meta learning

Conference Paper

Full-text available

Sep 2022

SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation

Conference Paper

Sep 2022

SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation

Preprint

Jul 2022

End-to-end speech synthesis models directly convert the input characters into an audio representation (e.g., spectrograms). Despite their impressive performance, such models have difficulty disambiguating the pronunciations of identically spelled words. To mitigate this issue, a separate Grapheme-to-Phoneme (G2P) model can be employed to convert th...

Resource-Efficient Separation Transformer

Preprint

Jun 2022

Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally-demanding and require a lot of learnable parameters. This paper explores Transformer-based speech separation with a reduced computational cost. Our main contribution is the development of the Resource-Efficient Separatio...

MetricGAN-U: Unsupervised Speech Enhancement/ Dereverberation Based Only on Noisy/ Reverberated Speech

Conference Paper

May 2022

Real-M: Towards Speech Separation on Real Mixtures

Conference Paper

May 2022

Learning Representations for New Sound Classes With Continual Self-Supervised Learning

Preprint

May 2022

In this paper, we present a self-supervised learning framework for continually learning representations for new sound classes. The proposed system relies on a continually trained neural encoder that is trained with similarity-based learning objectives without using labels. We show that representations learned with the proposed method generalize bet...

On Using Transformers for Speech-Separation

Preprint

Feb 2022

Transformers have enabled major improvements in deep learning. They often outperform recurrent and convolutional models in many tasks while taking advantage of parallel processing. Recently, we have proposed SepFormer, which uses self-attention and obtains state-of-the art results on WSJ0-2/3 Mix datasets for speech separation. In this paper, we ex...

Learning Representations for New Sound Classes With Continual Self-Supervised Learning

Article

Jan 2022

In this paper, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framework where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables...

Fig. 3. The masks generated by the SSM network of speech samples from...

Fig. 4. The t-SNE analysis of (a) ECAPA-TDNN speaker embedding of all...

Evaluation results of OSSEM and other causal SE systems on the...

OSSEM: one-shot speaker adaptive speech enhancement using meta learning

Preprint

Full-text available

Nov 2021

Although deep learning (DL) has achieved notable progress in speech enhancement (SE), further research is still required for a DL-based SE system to adapt effectively and efficiently to particular speakers. In this study, we propose a novel meta-learning-based speaker-adaptive SE approach (called OSSEM) that aims to achieve SE model adaptation in a...

REAL-M: Towards Speech Separation on Real Mixtures

Preprint

Oct 2021

In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation models on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the RE...

Subjective evaluation results for speech enhancement.

Comparison of MetricGAN-U with other methods for speech dereverberation...

MetricGAN-U: Unsupervised speech enhancement/ dereverberation based only on noisy/ reverberated speech

Preprint

Full-text available

Oct 2021

Most of the deep learning-based speech enhancement models are learned in a supervised manner, which implies that pairs of noisy and clean speech are required during training. Consequently, several noisy speeches recorded in daily life cannot be used to train the model. Although certain unsupervised learning frameworks have also been proposed to sol...

MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement

Conference Paper

Full-text available

Aug 2021

The Energy and Carbon Footprint of Training End-to-End Speech Recognizers

Conference Paper

Aug 2021

ECAPA-TDNN Embeddings for Speaker Diarization

Conference Paper

Full-text available

Aug 2021

Interpretable SincNet-based Deep Learning for Emotion Recognition from EEG brain activity

Conference Paper

Full-text available

Jul 2021

Machine learning methods, such as deep learning, show promising results in the medical domain. However, the lack of interpretability of these algorithms may hinder their applicability to medical decision support systems. This paper studies an interpretable deep learning technique, called SincNet. SincNet is a convolutional neural network that effic...

Interpretable SincNet-based Deep Learning for Emotion Recognition from EEG brain activity

Preprint

Full-text available

Jul 2021

SpeechBrain: A General-Purpose Speech Toolkit

Preprint

Full-text available

Jun 2021

SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally co...

Attention Is All You Need In Speech Separation

Conference Paper

Full-text available

Jun 2021

Figure 1: A SimpleMath command and its label.

Timers and Such statistics. ( * Speaker counts are ap- proximate; see...

Timers and Such: A Practical Benchmark for Spoken Language Understanding with Numbers

Preprint

Full-text available

Apr 2021

This paper introduces Timers and Such, a new open source dataset of spoken English commands for common voice control use cases involving numbers. We describe the gap in existing spoken language understanding datasets that Timers and Such fills, the design and creation of the dataset, and experiments with a number of ASR-based and end-to-end baselin...

Figure 1: Block diagram of the ECAPA-TDNN model [9].

DERs achieved by the proposed system on distant and close talking audio...

ECAPA-TDNN Embeddings for Speaker Diarization

Preprint

Full-text available

Apr 2021

Learning robust speaker embeddings is a crucial step in speaker diarization. Deep neural networks can accurately capture speaker discriminative characteristics and popular deep embeddings such as x-vectors are nowadays a fundamental component of modern diarization systems. Recently, some improvements over the standard TDNN architecture used for x-v...

Public Repository

Experiment Findings

Mar 2021

Mirco Ravanelli

https://github.com/speechbrain/speechbrain

Comparison on CATER Object Tracking. Here, we compare the Top-1 and...

Transformers with Competitive Ensembles of Independent Mechanisms

Preprint

Full-text available

Feb 2021

An important development in deep learning from the earliest MLPs has been a move towards architectures with structural inductive biases which enable the model to keep distinct sources of information and routes of processing well-separated. This structure is linked to the notion of independent mechanisms from the causality literature, in which a mec...

Quaternion Neural Networks for Multi-Channel Distant Speech Recognition

Conference Paper

Full-text available

Oct 2020

Despite the significant progress in automatic speech recognition (ASR), distant ASR remains challenging due to noise and reverberation. A common approach to mitigate this issue consists of equipping the recording devices with multiple microphones that capture the acoustic scene from different perspectives. These multi-channel audio recordings conta...

Attention is All You Need in Speech Separation

Preprint

Oct 2020

Recurrent Neural Networks (RNNs) have long been the dominant architecture in sequence-to-sequence learning. RNNs, however, are inherently sequential models that do not allow parallelization of their computations. Transformers are emerging as a natural alternative to standard RNNs, replacing recurrent computations with a multi-head attention mechani...

BIRD: Big Impulse Response Dataset

Preprint

Oct 2020

This paper introduces BIRD, the Big Impulse Response Dataset. This open dataset consists of 100,000 multichannel room impulse responses (RIRs) generated from simulations using the Image Method, making it the largest multichannel open dataset currently available. These RIRs can be used toperform efficient online data augmentation for scenarios that...

Quaternion Neural Networks for Multi-channel Distant Speech Recognition

Preprint

Full-text available

May 2020

Using Speech Synthesis to Train End-To-End Spoken Language Understanding Models

Conference Paper

May 2020

Multi-Task Self-Supervised Learning for Robust Speech Recognition

Conference Paper

Full-text available

Jan 2020

Despite the growing interest in unsupervised learning, extracting meaningful knowledge from unlabelled audio remains an open challenge. To take a step in this direction, we recently proposed a problem-agnostic speech encoder (PASE), that combines a con-volutional encoder followed by multiple neural networks, called workers, tasked to solve self-sup...

Phone error rate (PER) obtained on the TIMIT and DIRHA corpora...

Multi-task self-supervised learning for Robust Speech Recognition

Preprint

Full-text available

Jan 2020

Despite the growing interest in unsupervised learning, extracting meaningful knowledge from unlabelled audio remains an open challenge. To take a step in this direction, we recently proposed a problem-agnostic speech encoder (PASE), that combines a convolutional encoder followed by multiple neural networks, called workers, tasked to solve self-supe...

Using Cooperative Ad-hoc Microphone Arrays for ASR

Technical Report

Dec 2019

This report summarizes activities and achievements obtained during and after JSALT 2019 workshop on Using Cooperative Ad-hoc Microphone Arrays for ASR. Besides its contents, relevant contributions are given by the attached slides, used during the closing ceremony, and by recent paper submissions to ICASSP 2020. The report is organized in six sectio...

Using Speech Synthesis to Train End-to-End Spoken Language Understanding Models

Preprint

Oct 2019

End-to-end models are an attractive new approach to spoken language understanding (SLU) in which the meaning of an utterance is inferred directly from the raw audio without employing the standard pipeline composed of a separately trained speech recognizer and natural language understanding module. The downside of end-to-end SLU is that in-domain sp...

Learning Speaker Representations with Mutual Information

Conference Paper

Sep 2019

Learning Problem-Agnostic Speech Representations from Multiple Self-Supervised Tasks

Conference Paper

Sep 2019

Speech Model Pre-Training for End-to-End Spoken Language Understanding

Conference Paper

Sep 2019

Quaternion Recurrent Neural Networks

Preprint

Full-text available

May 2019

Recurrent neural networks (RNNs) are powerful architectures to model sequential data, due to their capability to learn short and long-term dependencies between the basic elements of a sequence. Nonetheless, popular tasks such as speech or images recognition, involve multi-dimensional input features that are characterized by strong internal dependen...

The Pytorch-kaldi Speech Recognition Toolkit

Conference Paper

May 2019

Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

Preprint

Full-text available

Apr 2019

Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. Some recent works, however, have shown that it is possible to derive useful speech representations by employing a sel...

Speech Model Pre-training for End-to-End Spoken Language Understanding

Preprint

Full-text available

Apr 2019

Whereas conventional spoken language understanding (SLU) systems map speech to text, and then text to intent, end-to-end SLU systems map speech directly to intent through a single trainable model. Achieving high accuracy with these end-to-end models without a large amount of training data is difficult. We propose a method to reduce the data require...

Speech Model Pre-training for End-to-End Spoken Language Understanding

Preprint

Full-text available

Apr 2019

Word error rate (WER) obtained on the DIRHA corpus.

Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

Preprint

Full-text available

Apr 2019

Fig. 2: Cumulative frequency response of SincNet and CNN filters on...

Speech and Speaker Recognition from Raw Waveform with SincNet

Preprint

Full-text available

Dec 2018

Deep neural networks can learn complex and abstract representations, that are progressively obtained by combining simpler ones. A recent trend in speech and speaker recognition consists in discovering these representations starting from raw audio samples directly. Differently from standard hand-crafted features such as MFCCs or FBANK, the raw wavef...

Speech recognition with quaternion neural networks

Conference Paper

Full-text available

Dec 2018

Neural network architectures are at the core of powerful automatic speech recognition systems (ASR). However, while recent researches focus on novel model architectures, the acoustic input features remain almost unchanged. Traditional ASR systems rely on multidimensional acoustic features such as the Mel filter bank energies alongside with the firs...

Learning Speaker Representations with Mutual Information

Preprint

Full-text available

Dec 2018

Learning good representations is of crucial importance in deep learning. Mutual Information (MI) or similar measures of statistical dependence are promising tools for learning these representations in an unsupervised way. Even though the mutual information between two random variables is hard to measure directly in high dimensional spaces, some rec...

Learning Speaker Representations with Mutual Information

Preprint

Full-text available

Dec 2018

Speaker Recognition from Raw Waveform with SincNet

Conference Paper

Dec 2018

Interpretable Convolutional Filters with SincNet

Preprint

Full-text available

Nov 2018

Deep learning is currently playing a crucial role toward higher levels of artificial intelligence. This paradigm allows neural networks to learn complex and abstract representations, that are progressively obtained by combining simpler ones. Nevertheless , the internal "black-box" representations automatically discovered by current neural architect...

Interpretable Convolutional Filters with SincNet

Preprint

Full-text available

Nov 2018

Deep learning is currently playing a crucial role toward higher levels of artificial intelligence. This paradigm allows neural networks to learn complex and abstract representations, that are progressively obtained by combining simpler ones. Nevertheless, the internal "black-box" representations automatically discovered by current neural architectu...

Speech recognition with quaternion neural networks

Preprint

Full-text available

Nov 2018

THE PYTORCH-KALDI SPEECH RECOGNITION TOOLKIT

Preprint

Full-text available

Nov 2018

The availability of open-source software is playing a remarkable role in the popularization of speech recognition and deep learning. Kaldi, for instance, is nowadays an established framework used to develop state-of-the-art speech recognizers. PyTorch is used to build neural networks with the Python language and has recently spawn tremendous intere...

The PyTorch-Kaldi Speech Recognition Toolkit

Preprint

Full-text available

Nov 2018

Twin Regularization for Online Speech Recognition

Poster

Full-text available

Sep 2018

Online speech recognition is crucial for developing natural human-machine interfaces. This modality, however, is significantly more challenging than off-line ASR, since real-time/low-latency constraints inevitably hinder the use of future information, that is known to be very helpful to perform robust predictions. A popular solution to mitigate thi...

Twin Regularization for Online Speech Recognition

Conference Paper

Full-text available

Sep 2018

SPEAKER RECOGNITION FROM RAW WAVEFORM WITH SINCNET

Preprint

Full-text available

Aug 2018

Deep learning is progressively gaining popularity as a viable alternative to i-vectors for speaker recognition. Promising results have been recently obtained with Convolutional Neural Networks (CNNs) when fed by raw speech samples directly. Rather than employing standard hand-crafted features, the latter CNNs learn low-level speech representations...

Speaker Recognition from raw waveform with SincNet

Preprint

Full-text available

Jul 2018

Quaternion Recurrent Neural Networks

Preprint

Jun 2018

Automatic context window composition for distant speech recognition

Article

Full-text available

May 2018

Distant speech recognition is being revolutionized by deep learning, that has contributed to significantly outperform previous HMM-GMM systems. A key aspect behind the rapid rise and success of DNNs is their ability to better manage large time contexts. With this regard, asymmetric context windows that embed more past than future frames have been r...

Twin Regularization for online speech recognition

Article

Full-text available

Apr 2018

Light Gated Recurrent Units for Speech Recognition

Article

Full-text available

Mar 2018

A field that has directly benefited from the recent advances in deep learning is automatic speech recognition (ASR). Despite the great achievements of the past decades, however, a natural and robust human–machine speech interaction still appears to be out of reach, especially in challenging environments characterized by significant noise and reverb...

Deep Learning for Distant Speech Recognition

Article

Full-text available

Dec 2017

Mirco Ravanelli

Deep learning is an emerging technology that is considered one of the most promising directions for reaching higher levels of artificial intelligence. Among the other achievements, building computers that understand speech represents a crucial leap towards intelligent machines. Despite the great efforts of the past decades, however, a natural and r...

Deep Learning for Distant Speech Recognition

Thesis

Full-text available

Dec 2017

Mirco Ravanelli

Realistic multi-microphone data simulation for distant speech recognition

Article

Full-text available

Nov 2017

The availability of realistic simulated corpora is of key importance for the future progress of distant speech recognition technology. The reliability, flexibility and low computational cost of a data simulation process may ultimately allow researchers to train, tune and test different techniques in a variety of acoustic scenarios, avoiding the lab...

Audio Concept Classification with Hierarchical Deep Neural Networks

Article

Full-text available

Oct 2017

Audio-based multimedia retrieval tasks may identify semantic information in audio streams, i.e., audio concepts (such as music, laughter, or a revving engine). Conventional Gaussian-Mixture-Models have had some success in classifying a reduced set of audio concepts. However, multi-class classification can benefit from context window analysis and th...

Contaminated speech training methods for robust DNN-HMM distant speech recognition

Article

Full-text available

Oct 2017

Despite the significant progress made in the last years, state-of-the-art speech recognition technologies provide a satisfactory performance only in the close-talking condition. Robustness of distant speech recognition in adverse acoustic conditions, on the other hand, remains a crucial open issue for future applications of human-machine interactio...

The DIRHA-English corpus and related tasks for distant-speech recognition in domestic environments

Article

Full-text available

Oct 2017

This paper introduces the contents and the possible usage of the DIRHA-ENGLISH multi-microphone corpus, recently realized under the EC DIRHA project. The reference scenario is a domestic environment equipped with a large number of microphones and microphone arrays distributed in space. The corpus is composed of both real and simulated material, and...

Improving speech recognition by revising gated recurrent units

Article

Full-text available

Sep 2017

Speech recognition is largely taking advantage of deep learning, showing that substantial benefits can be obtained by modern Recurrent Neural Networks (RNNs). The most popular RNNs are Long Short-Term Memory (LSTMs), which typically reach state-of-the-art performance in many tasks thanks to their ability to learn long-term dependencies and robustne...

Improving Speech Recognition by Revising Gated Recurrent Units

Conference Paper

Full-text available

Aug 2017

Batch-normalized joint training for DNN-based distant speech recognition

Article

Full-text available

Mar 2017

Improving distant speech recognition is a crucial step towards flexible human-machine interfaces. Current technology, however, still exhibits a lack of robustness, especially when adverse acoustic conditions are met. Despite the significant progress made in the last years on both speech enhancement and speech recognition, one potential limitation o...

A network of deep neural networks for distant speech recognition

Article

Full-text available

Mar 2017

Despite the remarkable progress recently made in distant speech recognition, state-of-the-art technology still suffers from a lack of robustness, especially when adverse acoustic conditions characterized by non-stationary noises and reverberation are met. A prominent limitation of current systems lies in the lack of matching and communication betwe...

A network of deep neural networks for Distant Speech Recognition

Conference Paper

Full-text available

Mar 2017

Batch-normalized joint training for DNN-based distant speech recognition

Conference Paper

Full-text available

Dec 2016

Realistic Multi-Microphone Data Simulation for Distant Speech Recognition

Conference Paper

Full-text available

Sep 2016

The DIRHA-ENGLISH corpus and related tasks for distant-speech recognition in domestic environments

Conference Paper

Full-text available

Dec 2015

Insights into Audio-Based Multimedia Event Classification with Neural Networks

Conference Paper

Full-text available

Nov 2015

Multimedia Event Detection (MED) aims to identify events—also called scenes—in videos, such as a flash mob or a wedding ceremony. Audio content information complements cues such as visual content and text. In this paper, we explore the optimization of neural networks (NNs) for audio-based multimedia event classification , and discuss some insights...

Contaminated speech training methods for robust DNN-HMM distant speech recognition

Conference Paper

Full-text available

Sep 2015

A Multi-Channel Corpus for Distant-Speech Interaction in Presence of Known Interferences

Conference Paper

Full-text available

Apr 2015

This paper describes a new corpus of multi-channel audio data designed to study and develop distant-speech recognition systems able to cope with known interfering sounds propagating in an environment. The corpus consists of both real and simulated signals and of a corresponding detailed annotation. An extensive set of speech recognition experiments...

SmartLamp

Data

Oct 2014

A smart lamp based on distant-talking speech recognition with a MEMS microphone array.

DIRHA smart home

Data

Oct 2014

this video shows a prototype developed under the DIRHA project and currently working in a real apartment in Trento (Italy). It is a real-time, multi-room and multi-microphone distant-talking speech recognition system which allows the users to command and control with their voice most of the home appliances and devices. The system can work both with...

DIRHA-GRID-example audio2

Data

Sep 2014

An example of the multi-microphone multi-room dataset generated under the EU DIRHA project. The target scenario is the domestic environment and the reference language is English

DIRHA simcorpus- short sample

Data

Sep 2014

An example of the multi-microphone multi-room and multi-language database generated under the EU DIRHA project. The target scenario is the domestic environment with 40 sample-synchronized channels distributed over 5 different rooms of a real apartment are available. This dataset can be used for distant-talking speech recognition experiments, multi-...

DIRHA-GRID-example audio1

Data

Sep 2014

An example of the multi-microphone multi-room dataset generated under the EU DIRHA project. The target scenario is the domestic environment and the reference language is English. This dataset can be used for distant-talking speech recognition experiments, multi-microphone signal processing, speaker localization, voice activity detection and many ot...

On the selection of the impulse responses for distant-speech recognition based on contaminated speech training

Conference Paper

Full-text available

Sep 2014

Distant-speech recognition represents a technology of fundamental importance for future development of assistive applications characterized by flexible and unobtrusive interaction in home environments. State-of-the-art speech recognition still exhibits lack of robustness, and an unacceptable performance variability, due to environmental noise, reve...