Conference PaperPDF Available

Wiretapping via Mimicry: Short Voice Imitation Man-in-the-Middle Attacks on Crypto Phones

Authors:
  • Visa Research

Abstract and Figures

Establishing secure voice, video and text over Internet (VoIP) communications is a crucial task necessary to prevent eavesdropping and man-in-the-middle attacks. The traditional means of secure session establishment (e.g., those relying upon PKI or KDC) require a dedicated infrastructure and may impose unwanted trust onto third-parties. "Crypto Phones" (popular instances such as PGPfone and Zfone), in contrast, provide a purely peer-to-peer user-centric secure mechanism claiming to completely address the problem of wiretapping. The secure association mechanism in Crypto Phones is based on cryptographic protocols employing Short Authenticated Strings (SAS) validated by end users over the voice medium. The security of Crypto Phones crucially relies on the assumption that the voice channel, over which SAS is validated by the users, provides the properties of integrity and source authentication. In this paper, we challenge this assumption, and report on automated SAS voice imitation man-in-the-middle attacks that can compromise the security of Crypto Phones in both two-party and multi-party settings, even if users pay due diligence. The first attack, called the short voice reordering attack, builds arbitrary SAS strings in a victim's voice by reordering previously eavesdropped SAS strings spoken by the victim. The second attack, called the short voice morphing attack, builds arbitrary SAS strings in a victim's voice from a few previously eavesdropped sentences (less than 3 minutes) spoken by the victim. We design and implement our attacks using off-the-shelf speech recognition/synthesis tools, and comprehensively evaluate them with respect to both manual detection (via a user study with 30 participants) and automated detection. The results demonstrate the effectiveness of our attacks against three prominent forms of SAS encodings: numbers, PGP word lists and Madlib sentences. These attacks can be used by a wiretapper to compromise the confidentiality and privacy of Crypto Phones voice, video and text communications (plus authenticity in case of text conversations).
Content may be subject to copyright.
A preview of the PDF is not available
... The choice between AES-128 and AES-256 often hinges on the required security level versus the computational overhead. Although AES-256 offers a higher security margin, its computational cost is also greater, impacting system performance [6]. ...
Thesis
Full-text available
Image Encryption using, Memristive Neural Network, PRNG, Substitution-Box, and Fibonacci Q-Matrix
... In [33], a "voice reordering" attack was introduced against end-to-end shortspoken text authenticated systems. In the reordering attack, the attacker collects isolated units of the fingerprint (e.g., words or digits) and combines them to create new fingerprints not spoken before. ...
Preprint
End-to-End Encryption (E2EE) aims to make all messages impossible to read by anyone except you and your intended recipient(s). Many well-known and widely used Instant-Messaging (IM) applications (such as Signal, WhatsApp, and Apple's iMessage) claim to provide E2EE. However, a recent technique called client-side scanning (CSS) makes these E2EE claims grandiose and hollow promises. The CSS is a technology that scans all sending and receiving messages from one end to the other. Some in industry and government now advocate this CSS technology to combat the growth of malicious child pornography, terrorism, and other illicit communication. Even though combating the spread of illegal and morally objectionable content is a laudable effort, it may open further backdoors that impact the user's privacy and security. Therefore, it is not E2EE when there are censorship mechanisms and backdoors in end-to-end encrypted applications. In this paper, we introduce an encrypted keyboard that functions as a system keyboard, enabling users to employ it across all applications on their phones when entering data. By utilizing this encrypted keyboard, users can locally encrypt and decrypt messages, effectively bypassing the CSS system. We first design and implement our encrypted keyboard as a custom keyboard application, and then we evaluate the effectiveness and security of our encrypted keyboard. Our study results show that our encrypted keyboard can successfully encrypt and decrypt all sending and receiving messages through IM applications, and therefore, it can successfully defeat the CSS technology in end-to-end encrypted systems. We also show that our encrypted keyboard can be used to add another layer of E2EE functionality on top of the existing E2EE functionality implemented by many end-to-end encrypted applications.
... [63], [80] for survey). There are other voice attacks in the speaker recognition domain, such as hidden voice attacks [78] and spoofing attacks [79], [88], [89], [90], [91], [92]. Though these attacks have different attack goals and scenarios from adversarial attacks [15], our preliminary evaluation shows that it is possible to mitigate hidden voice attack [78] and speech synthesis attack [79] via input transformations. ...
Preprint
Full-text available
Speaker recognition systems (SRSs) have recently been shown to be vulnerable to adversarial attacks, raising significant security concerns. In this work, we systematically investigate transformation and adversarial training based defenses for securing SRSs. According to the characteristic of SRSs, we present 22 diverse transformations and thoroughly evaluate them using 7 recent promising adversarial attacks (4 white-box and 3 black-box) on speaker recognition. With careful regard for best practices in defense evaluations, we analyze the strength of transformations to withstand adaptive attacks. We also evaluate and understand their effectiveness against adaptive attacks when combined with adversarial training. Our study provides lots of useful insights and findings, many of them are new or inconsistent with the conclusions in the image and speech recognition domains, e.g., variable and constant bit rate speech compressions have different performance, and some non-differentiable transformations remain effective against current promising evasion techniques which often work well in the image domain. We demonstrate that the proposed novel feature-level transformation combined with adversarial training is rather effective compared to the sole adversarial training in a complete white-box setting, e.g., increasing the accuracy by 13.62% and attack cost by two orders of magnitude, while other transformations do not necessarily improve the overall defense capability. This work sheds further light on the research directions in this field. We also release our evaluation platform SPEAKERGUARD to foster further research.
... In replay attacks, the adversary pre-records and playbacks the voice sample of the passphrase of a legal user to deceive the authentication system [23]. An adversary can also mimic the voice characteristics and style of a legal user to conduct impersonation attacks [24]. Spoofing attacks may greatly harm the users as the adversary may gain access to the victim's smartphone to steal private information and perform malicious operations. ...
Article
Full-text available
Voice authentication is increasingly used for sensitive operations in mobile devices. However, voice biometrics focuses on distinguishing individuals by their spectral features, which cannot deal with spoofing attacks. In this paper, we design and implement a novel software-only anti-spoofing system on smartphones. Our system leverages the pop noise, which is generated by the users oral airflow when speaking the passphrase opposite the microphone. The pop noise is delicate and subject to user diversity, making it hard to be recorded by replay attacks beyond a certain distance and to be imitated precisely by impersonators. Especially, we design a new pop noise detection scheme to pinpoint pop noises at the phonemic level, based on which we establish a theoretical model to calculate the sound pressure level from the speech signal in order to get the estimated pressure signal, and then analyze the consistency with the actual pressure signal extracted from the pop noise. Our evaluation on a dataset of 30 participants and three smartphones shows that our system achieves over 94.79% accuracy. Our system requires no additional hardware and is robust to various factors including authentication angle, authentication distance, the length of passphrase, ambient noise, etc.
Article
Continuous identity authentication is critical for privacy protection throughout an entire user login session. In this paper, we propose a continuous user authentication mechanism namely, which employs the vibration responses from hand biometrics and is passively activated by natural user-device interaction. Hand vibration responses are embedded in the mechanical vibration of a force-bearing body consisting of one mobile device and one user hand. A built-in accelerometer of the device can capture hand-dependent vibration signals. Considering the concealment of vibration generation and the non-replicability of hand structure, it’s difficult for attackers to counterfeit user identity. Moreover, for ensuring the robustness of authentication performance to tapping behavior interference, we construct a data augmentation module jointly leveraging a signal processing and learning-based pipeline. It can generate enough vibration responses representing hand structure biometrics under various behaviors, thereby making comprehensively understand vibration response variation. We prototype on smartphones, and extensive experiments demonstrate that can achieve satisfactory authentication accuracy.
Chapter
End-to-End Encryption (E2EE) aims to make all messages impossible to read by anyone except you and your intended recipient(s). Many well-known and widely used Instant-Messaging (IM) applications (such as Signal, WhatsApp, Apple’s iMessage, and Telegram) claim to provide an E2EE functionality. However, a recent technique called client-side scanning (CSS), which could be implemented by these IM applications, makes these E2EE claims grandiose and hollow promises. The CSS is a technology that scans all sending and receiving messages from one end to the other, including text, images, audio, and video files. Some in industry and government now advocate this CSS technology to combat the growth of malicious child pornography, terrorism, and other illicit communication. Even though combating the spread of illegal and morally objectionable content is a laudable effort, it may open further backdoors that impact the user’s privacy and security. Therefore, it is not end-to-end encryption when there are censorship mechanisms and backdoors in end-to-end encrypted applications. In this paper, we shed light on this hugely problematic issue by introducing an encrypted keyboard that works as a system keyboard and can be enabled on the user’s phone device as a default system keyboard. Therefore, it works on every application on the user’s phone device when the user is asked to enter some data. To avoid the CSS system, users can use this encrypted keyboard to encrypt and decrypt their messages locally on their phone devices when sending and receiving them via IM applications. We first design and implement our encrypted keyboard as a custom keyboard application, and then we evaluate the effectiveness and security of our encrypted keyboard. Our study results show that our encrypted keyboard can successfully encrypt and decrypt all sending and receiving messages through IM applications, and therefore, it can successfully defeat the CSS technology in end-to-end encrypted systems. We also show that our encrypted keyboard can be used to add another layer of E2EE functionality on top of the existing E2EE functionality implemented by many end-to-end encrypted applications.KeywordsEnd-to-end encryptionEncrypted keyboardIM securityClient-side scanning
Article
Full-text available
Traditional automatic face morphing techniques tend to generate blurry intermediate frames when the two input faces differ significantly. We propose a new face morphing approach that deals explicitly with large pose and expression variations. We recover the 3D face geometry of the input images using a projection on a prelearned 3D face subspace. The geometry is interpolated by factoring the expression and pose and varying them smoothly across the sequence. Finally we pose the morphing problem as an iterative optimization with an objective that combines similarity of each frame to the geometry-induced warped sources, with a similarity between neighboring frames for temporal coherence. Experimental results show that our method can generate higher quality face morphing results for more extreme pose, expression and appearance changes than previous methods.
Book
From common consumer products such as cell phones and MP3 players to more sophisticated projects such as human-machine interfaces and responsive robots, speech technologies are now everywhere. Many think that it is just a matter of time before more applications of the science of speech become inescapable in our daily life. This handbook is meant to play a fundamental role for sustainable progress in speech research and development. Springer Handbook of Speech Processing targets three categories of readers: graduate students, professors and active researchers in academia and research labs, and engineers in industry who need to understand or implement some specific algorithms for their speech-related products. The handbook could also be used as a sourcebook for one or more graduate courses on signal processing for speech and different aspects of speech processing and applications. A quickly accessible source of application-oriented, authoritative and comprehensive information about these technologies, it combines the established knowledge derived from research in such fast evolving disciplines as Signal Processing and Communications, Acoustics, Computer Science and Linguistics.
Chapter
Introduction and Overview Legal Issues Advertising, Marketing, Persuasion, and Other Related Applications Dubbing and Voiceovers Announcing, Newscasting, and Sportscasting Singing
Chapter
Introduction Interactions Between What is Said and How it is Said Brain Function Underlying Emotions and Emotional Nuances in Speech The Nature and Function of Emotions Experimental Approaches to the Study of Vocal Emotion How Does Emotion Affect the Voice? How do Listeners Perceive Emotion from Voices? Biological, Social, and Cross-Cultural Perspectives on Vocal Emotion Stress and Lie Detection from Voice Summary and Conclusions Personality and Voice Voice in Psychiatric Disease Detection of Intoxication from Voice
Conference Paper
Voice conversion - the methodology of automatically converting one's utterances to sound as if spoken by another speaker - presents a threat for applications relying on speaker verification. We study vulnerability of text-independent speaker verification systems against voice conversion attacks using telephone speech. We implemented a voice conversion systems with two types of features and nonparallel frame alignment methods and five speaker verification systems ranging from simple Gaussian mixture models (GMMs) to state-of-the-art joint factor analysis (JFA) recognizer. Experiments on a subset of NIST 2006 SRE corpus indicate that the JFA method is most resilient against conversion attacks. But even it experiences more than 5-fold increase in the false acceptance rate from 3.24 % to 17.33 %.
Article
This Recommendation describes methods and procedures for conducting subjective evaluations oftransmission quality. The main revision encompassed by this version of this Recommendation is theaddition of an annex describing the Comparison Category Rating (CCR) procedure. Othermodifications have been made to align this Recommendation with recent revision ofRecommendation P.830.