Conference PaperPDF Available

Wiretapping via Mimicry: Short Voice Imitation Man-in-the-Middle Attacks on Crypto Phones

November 2014

November 2014

DOI:10.1145/2660267.2660274

Conference: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security

Authors:

Maliheh Shirvanian

Visa Research

Establishing secure voice, video and text over Internet (VoIP) communications is a crucial task necessary to prevent eavesdropping and man-in-the-middle attacks. The traditional means of secure session establishment (e.g., those relying upon PKI or KDC) require a dedicated infrastructure and may impose unwanted trust onto third-parties. "Crypto Phones" (popular instances such as PGPfone and Zfone), in contrast, provide a purely peer-to-peer user-centric secure mechanism claiming to completely address the problem of wiretapping. The secure association mechanism in Crypto Phones is based on cryptographic protocols employing Short Authenticated Strings (SAS) validated by end users over the voice medium. The security of Crypto Phones crucially relies on the assumption that the voice channel, over which SAS is validated by the users, provides the properties of integrity and source authentication. In this paper, we challenge this assumption, and report on automated SAS voice imitation man-in-the-middle attacks that can compromise the security of Crypto Phones in both two-party and multi-party settings, even if users pay due diligence. The first attack, called the short voice reordering attack, builds arbitrary SAS strings in a victim's voice by reordering previously eavesdropped SAS strings spoken by the victim. The second attack, called the short voice morphing attack, builds arbitrary SAS strings in a victim's voice from a few previously eavesdropped sentences (less than 3 minutes) spoken by the victim. We design and implement our attacks using off-the-shelf speech recognition/synthesis tools, and comprehensively evaluate them with respect to both manual detection (via a user study with 30 participants) and automated detection. The results demonstrate the effectiveness of our attacks against three prominent forms of SAS encodings: numbers, PGP word lists and Madlib sentences. These attacks can be used by a wiretapper to compromise the confidentiality and privacy of Crypto Phones voice, video and text communications (plus authenticity in case of text conversations).

: Objective evaluation results for the morphing attack

…

: Demographic Info: User Study N = 30

…

High-level diagram of the attack

…

: Mean (Std. Dev) ratings for original and attacked SAS

…

Cfone Protocol Flow (SIP: Session Initiation Protocol; RTP: Real-Time Transport Protocol)

…

Figures - uploaded by Maliheh Shirvanian

Content may be subject to copyright.

Content uploaded by Maliheh Shirvanian

Content may be subject to copyright.

A preview of the PDF is not available

Memristive Coupled Neural Network Based Visual Signal Encryption

Thesis

Full-text available

May 2024

Image Encryption using, Memristive Neural Network, PRNG, Substitution-Box, and Fibonacci Q-Matrix

Exploring Encrypted Keyboards to Defeat Client-Side Scanning in End-to-End Encryption Systems

Preprint

Jul 2023

End-to-End Encryption (E2EE) aims to make all messages impossible to read by anyone except you and your intended recipient(s). Many well-known and widely used Instant-Messaging (IM) applications (such as Signal, WhatsApp, and Apple's iMessage) claim to provide E2EE. However, a recent technique called client-side scanning (CSS) makes these E2EE claims grandiose and hollow promises. The CSS is a technology that scans all sending and receiving messages from one end to the other. Some in industry and government now advocate this CSS technology to combat the growth of malicious child pornography, terrorism, and other illicit communication. Even though combating the spread of illegal and morally objectionable content is a laudable effort, it may open further backdoors that impact the user's privacy and security. Therefore, it is not E2EE when there are censorship mechanisms and backdoors in end-to-end encrypted applications. In this paper, we introduce an encrypted keyboard that functions as a system keyboard, enabling users to employ it across all applications on their phones when entering data. By utilizing this encrypted keyboard, users can locally encrypt and decrypt messages, effectively bypassing the CSS system. We first design and implement our encrypted keyboard as a custom keyboard application, and then we evaluate the effectiveness and security of our encrypted keyboard. Our study results show that our encrypted keyboard can successfully encrypt and decrypt all sending and receiving messages through IM applications, and therefore, it can successfully defeat the CSS technology in end-to-end encrypted systems. We also show that our encrypted keyboard can be used to add another layer of E2EE functionality on top of the existing E2EE functionality implemented by many end-to-end encrypted applications.

Towards Understanding and Mitigating Audio Adversarial Examples for Speaker Recognition

Preprint

Full-text available

Jun 2022

Speaker recognition systems (SRSs) have recently been shown to be vulnerable to adversarial attacks, raising significant security concerns. In this work, we systematically investigate transformation and adversarial training based defenses for securing SRSs. According to the characteristic of SRSs, we present 22 diverse transformations and thoroughly evaluate them using 7 recent promising adversarial attacks (4 white-box and 3 black-box) on speaker recognition. With careful regard for best practices in defense evaluations, we analyze the strength of transformations to withstand adaptive attacks. We also evaluate and understand their effectiveness against adaptive attacks when combined with adversarial training. Our study provides lots of useful insights and findings, many of them are new or inconsistent with the conclusions in the image and speech recognition domains, e.g., variable and constant bit rate speech compressions have different performance, and some non-differentiable transformations remain effective against current promising evasion techniques which often work well in the image domain. We demonstrate that the proposed novel feature-level transformation combined with adversarial training is rather effective compared to the sole adversarial training in a complete white-box setting, e.g., increasing the accuracy by 13.62% and attack cost by two orders of magnitude, while other transformations do not necessarily improve the overall defense capability. This work sheds further light on the research directions in this field. We also release our evaluation platform SPEAKERGUARD to foster further research.

Securing Liveness Detection for Voice Authentication via Pop Noises

Article

Full-text available

Jan 2022

Voice authentication is increasingly used for sensitive operations in mobile devices. However, voice biometrics focuses on distinguishing individuals by their spectral features, which cannot deal with spoofing attacks. In this paper, we design and implement a novel software-only anti-spoofing system on smartphones. Our system leverages the pop noise, which is generated by the users oral airflow when speaking the passphrase opposite the microphone. The pop noise is delicate and subject to user diversity, making it hard to be recorded by replay attacks beyond a certain distance and to be imitated precisely by impersonators. Especially, we design a new pop noise detection scheme to pinpoint pop noises at the phonemic level, based on which we establish a theoretical model to calculate the sound pressure level from the speech signal in order to get the estimated pressure signal, and then analyze the consistency with the actual pressure signal extracted from the pop noise. Our evaluation on a dataset of 30 participants and three smartphones shows that our system achieves over 94.79% accuracy. Our system requires no additional hardware and is robust to various factors including authentication angle, authentication distance, the length of passphrase, ambient noise, etc.

Automating Key Fingerprint Comparisons in Secure Mobile Messaging Apps: A Case Study of Signal

Conference Paper

Jun 2024

"Get in Researchers; We're Measuring Reproducibility": A Reproducibility Study of Machine Learning Papers in Tier 1 Security Conferences

Conference Paper

Nov 2023

SoK: An Analysis of End-to-End Encryption and Authentication Ceremonies in Secure Messaging Systems

Conference Paper

Jun 2023

Incidental Incremental In-Band Fingerprint Verification: a Novel Authentication Ceremony for End-to-End Encrypted Messaging

Conference Paper

Jun 2023

Nathan Malkin

Data Augmentation-Enabled Continuous User Authentication via Passive Vibration Response

Article

Aug 2023

Continuous identity authentication is critical for privacy protection throughout an entire user login session. In this paper, we propose a continuous user authentication mechanism namely, which employs the vibration responses from hand biometrics and is passively activated by natural user-device interaction. Hand vibration responses are embedded in the mechanical vibration of a force-bearing body consisting of one mobile device and one user hand. A built-in accelerometer of the device can capture hand-dependent vibration signals. Considering the concealment of vibration generation and the non-replicability of hand structure, it’s difficult for attackers to counterfeit user identity. Moreover, for ensuring the robustness of authentication performance to tapping behavior interference, we construct a data augmentation module jointly leveraging a signal processing and learning-based pipeline. It can generate enough vibration responses representing hand structure biometrics under various behaviors, thereby making comprehensively understand vibration response variation. We prototype on smartphones, and extensive experiments demonstrate that can achieve satisfactory authentication accuracy.

Exploring Encrypted Keyboards to Defeat Client-Side Scanning in End-to-End Encryption Systems

Chapter

Mar 2023

End-to-End Encryption (E2EE) aims to make all messages impossible to read by anyone except you and your intended recipient(s). Many well-known and widely used Instant-Messaging (IM) applications (such as Signal, WhatsApp, Apple’s iMessage, and Telegram) claim to provide an E2EE functionality. However, a recent technique called client-side scanning (CSS), which could be implemented by these IM applications, makes these E2EE claims grandiose and hollow promises. The CSS is a technology that scans all sending and receiving messages from one end to the other, including text, images, audio, and video files. Some in industry and government now advocate this CSS technology to combat the growth of malicious child pornography, terrorism, and other illicit communication. Even though combating the spread of illegal and morally objectionable content is a laudable effort, it may open further backdoors that impact the user’s privacy and security. Therefore, it is not end-to-end encryption when there are censorship mechanisms and backdoors in end-to-end encrypted applications. In this paper, we shed light on this hugely problematic issue by introducing an encrypted keyboard that works as a system keyboard and can be enabled on the user’s phone device as a default system keyboard. Therefore, it works on every application on the user’s phone device when the user is asked to enter some data. To avoid the CSS system, users can use this encrypted keyboard to encrypt and decrypt their messages locally on their phone devices when sending and receiving them via IM applications. We first design and implement our encrypted keyboard as a custom keyboard application, and then we evaluate the effectiveness and security of our encrypted keyboard. Our study results show that our encrypted keyboard can successfully encrypt and decrypt all sending and receiving messages through IM applications, and therefore, it can successfully defeat the CSS technology in end-to-end encrypted systems. We also show that our encrypted keyboard can be used to add another layer of E2EE functionality on top of the existing E2EE functionality implemented by many end-to-end encrypted applications.KeywordsEnd-to-end encryptionEncrypted keyboardIM securityClient-side scanning

Face Morphing using 3D-Aware Appearance Optimization

Article

Full-text available

May 2012

Traditional automatic face morphing techniques tend to generate blurry intermediate frames when the two input faces differ significantly. We propose a new face morphing approach that deals explicitly with large pose and expression variations. We recover the 3D face geometry of the input images using a projection on a prelearned 3D face subspace. The geometry is interpolated by factoring the expression and pose and varying them smoothly across the sequence. Finally we pose the morphing problem as an iterative optimization with an objective that combines similarity of each frame to the geometry-induced warped sources, with a similarity between neighboring frames for temporal coherence. Experimental results show that our method can generate higher quality face morphing results for more extreme pose, expression and appearance changes than previous methods.

Foundations of Voice Studies: An Interdisciplinary Approach to Voice Production and Perception

Book

Apr 2011

Forensic Speaker Identification

Book

Jul 2002

Phil Rose

Springer Handbook of Speech Processing

Book

Jan 2008

From common consumer products such as cell phones and MP3 players to more sophisticated projects such as human-machine interfaces and responsive robots, speech technologies are now everywhere. Many think that it is just a matter of time before more applications of the science of speech become inescapable in our daily life. This handbook is meant to play a fundamental role for sustainable progress in speech research and development. Springer Handbook of Speech Processing targets three categories of readers: graduate students, professors and active researchers in academia and research labs, and engineers in industry who need to understand or implement some specific algorithms for their speech-related products. The handbook could also be used as a sourcebook for one or more graduate courses on signal processing for speech and different aspects of speech processing and applications. A quickly accessible source of application-oriented, authoritative and comprehensive information about these technologies, it combines the established knowledge derived from research in such fast evolving disciplines as Signal Processing and Communications, Acoustics, Computer Science and Linguistics.

Miscellany: Voice in Law Enforcement, Media and Singing

Chapter

Apr 2011

Introduction and Overview Legal Issues Advertising, Marketing, Persuasion, and Other Related Applications Dubbing and Voiceovers Announcing, Newscasting, and Sportscasting Singing

Perception of Emotion and Personality from Voice

Chapter

Apr 2011

Introduction Interactions Between What is Said and How it is Said Brain Function Underlying Emotions and Emotional Nuances in Speech The Nature and Function of Emotions Experimental Approaches to the Study of Vocal Emotion How Does Emotion Affect the Voice? How do Listeners Perceive Emotion from Voices? Biological, Social, and Cross-Cultural Perspectives on Vocal Emotion Stress and Lie Detection from Voice Summary and Conclusions Personality and Voice Voice in Psychiatric Disease Detection of Intoxication from Voice

Vulnerability of Speaker Verification Systems Against Voice Conversion Spoofing Attacks: the Case of Telephone Speech

Conference Paper

May 2012
Acoust Speech Signal Process

Voice conversion - the methodology of automatically converting one's utterances to sound as if spoken by another speaker - presents a threat for applications relying on speaker verification. We study vulnerability of text-independent speaker verification systems against voice conversion attacks using telephone speech. We implemented a voice conversion systems with two types of features and nonparallel frame alignment methods and five speaker verification systems ranging from simple Gaussian mixture models (GMMs) to state-of-the-art joint factor analysis (JFA) recognizer. Experiments on a subset of NIST 2006 SRE corpus indicate that the JFA method is most resilient against conversion attacks. But even it experiences more than 5-fold increase in the false acceptance rate from 3.24 % to 17.33 %.

The ability of listeners to identify voices

Article

Jan 1980

Perceptual identification of voices under normal, stress, and disguise speaking conditions

Article

Apr 1982

ITU-T: Methods for Subjective Determination of Transmission Quality. Rec. P.800

Article

Jan 1996

This Recommendation describes methods and procedures for conducting subjective evaluations oftransmission quality. The main revision encompassed by this version of this Recommendation is theaddition of an annex describing the Comparison Category Rating (CCR) procedure. Othermodifications have been made to align this Recommendation with recent revision ofRecommendation P.830.

Wiretapping via Mimicry: Short Voice Imitation Man-in-the-Middle Attacks on Crypto Phones

Abstract and Figures

Recommended publications

Short voice imitation man-in-the-middle attacks on Crypto Phones: Defeating humans and machines

CCCP: Closed Caption Crypto Phones to Resist MITM Attacks, Human Errors and Click-Through

Stethoscope: Crypto Phones with Transparent & Robust Fingerprint Comparisons using Inter Text-Speech...

On the Security and Usability of Crypto Phones