ArticlePDF Available

Comparison of the Ability of Neural Network Model and Humans to Detect a Cloned Voice

Authors:

Abstract and Figures

The vulnerability of the speaker identity verification system to attacks using voice cloning was examined. The research project assumed creating a model for verifying the speaker’s identity based on voice biometrics and then testing its resistance to potential attacks using voice cloning. The Deep Speaker Neural Speaker Embedding System was trained, and the Real-Time Voice Cloning system was employed based on the SV2TTS, Tacotron, WaveRNN, and GE2E neural networks. The results of attacks using voice cloning were analyzed and discussed in the context of a subjective assessment of cloned voice fidelity. Subjective test results and attempts to authenticate speakers proved that the tested biometric identity verification system might resist voice cloning attacks even if humans cannot distinguish cloned samples from original ones.
Content may be subject to copyright.
Article
Full-text available
Background: Aphasia is a communication disorder that affects the ability to process and produce language, which severely impacting their lives. Computer-aid exercise rehabilitation has shown to be highly effective for these patients. Objective: In our study, we proposed a speech rehabilitation system with mirrored therapy. The study goal is to construct a effective rehabilitation software for aphasia patients. Methods: This system collects patients' facial photos for mirrored video generation and speech synthesis. The visual feedback provided by the mirror creates an engaging and motivating experience for patients. And the evaluation platform employs machine learning technologies for assessing speech similarity. Results: The sophisticated task-oriented rehabilitation training with mirror therapy is also presented for experiments performing. The performance of three tasks reaches the average scores of 83.9% for vowel exercises, 74.3% for word exercies and 77.8% for sentence training in real time. Conclusions: The user-friendly application system allows patients to carry out daily training tasks instructed by the therapists or the prompt information of menu. Our work demonstrated a promising intelligent mirror software system for reading-based aphasia rehabilitation.
Article
Full-text available
The objective of this work is speaker recognition under noisy and unconstrained conditions. We make two key contributions. First, we introduce a very large-scale audio-visual dataset collected from open source media using a fully automated pipeline. Most existing datasets for speaker identification contain samples obtained under quite constrained conditions, and usually require manual annotations, hence are limited in size. We propose a pipeline based on computer vision techniques to create the dataset from open-source media. Our pipeline involves obtaining videos from YouTube; performing active speaker verification using a two-stream synchronization Convolutional Neural Network (CNN), and confirming the identity of the speaker using CNN based facial recognition. We use this pipeline to curate VoxCeleb which contains contains over a million ‘real-world’ utterances from over 6000 speakers. This is several times larger than any publicly available speaker recognition dataset. Second, we develop and compare different CNN architectures with various aggregation methods and training loss functions that can effectively recognise identities from voice under various conditions. The models trained on our dataset surpass the performance of previous works by a significant margin.
Chapter
In this chapter, we introduce some basics of spoken language processing (including both speech and natural language), which are fundamental to text-to-speech synthesis. Since speech and language are studied in the discipline of linguistics, we first overview some basic knowledge in linguistics and discuss a key concept called speech chain that is closely related to TTS. Then, we introduce speech signal processing, which covers the topics of digital signal processing, speech processing in the time and frequency domain, cepstrum analysis, linear prediction analysis, and speech parameter estimation. At last, we overview some typical speech processing tasks.KeywordsSpoken language processingLinguisticsSpeech chainSpeech signal processing