R5/R6: Improper Audio Downmix
This demo shows how AI services and browsers can perceive audio differently due to different downmix algorithms. An attacker can craft an audio file that makes a browser play one piece of content (e.g., "I refuse to admit guilt") while an AI service (like ASR) transcribes something completely different (e.g., "Your Honor, I plead guilty").
Note: Different players may use different downmix matrices. This vulnerability is known to reproduce in the Chrome browser and the Gemini environment. It may fail in other environments.
Note: Notably, attack success depends on the computer model, OS version, browser version, and playback device (headphones vs. speakers). We therefore list the environments where our tests succeeded. For browsers, we tested on Chrome v142.0.7444.60 (MacBook Pro 14-inch M3 Max, macOS 26.0.1, built-in speakers), Firefox v144.0.2 (Lenovo TP0096C, Windows 10 Enterprise, built-in speakers), and Edge v142.0.3595.69 (same configuration as Firefox).
Generate Attack Audio
Please enter the text you want humans (via Chrome) and AI (via backend) to perceive separately.
Note: Clicking "Generate" will call the hosted attack-audio API. The service creates multi-channel audio from your inputs and returns three WAV files: audio1_url, audio2_url, and the downmixed poc_audio_url. **The API requires a password to prevent potential external abuse.**
Note: Our demo utilizes the MiniMax Speech-02-turbo model for Text-To-Speech generation. The supported languages are subject to the limitations of the MiniMax Speech-02-turbo model. The API is deployed on an anonymous AWS Lambda function.