Please note that the 'AI Perception' image is a mock image. To trigger a loading error, it needs to be implemented within a specific framework.
This webpage is intended for S&P paper review purposes only. It is not for public use. Please do not disclose or distribute.
Single Medium, Multiple Perspectives
This page demonstrates the 11 types of semantic gaps identified in our paper. Below are attack samples where media players (Human Perception) and AI services (AI Perception) interpret the same file differently.
Virtual Cropping Ignorance
AI services ignore 'virtual crop' metadata (e.g., CLAP in HEIC/AVIF), processing the entire image while humans only see the cropped region.
Human Perception
AI Perception
Mirror Flip Ignorance
AI ignores metadata-based mirroring (e.g., 'imir' in AVIF), leading to misinterpretation of orientation-sensitive data like charts.
Human Perception
AI Perception
Rotation Ignorance
Similar to mirroring, AI services fail to apply rotation metadata, causing misidentification of rotated content (e.g., CAPTCHAs).
Human Perception
AI Perception
External Resource Ignorance
AI fails to process external resources (e.g., image-based subtitles in MKV or overlays in SVG), perceiving only the underlying content.
Human Perception
Video/SVG shows full-screen subtitle/overlay: "Harmful Content"
AI Perception
AI sees underlying video: "Benign Content"
Improper Audio Downmix
AI services use naive downmixing (e.g., simple average) for multi-channel audio, while humans hear a standard-compliant mix, enabling A2A attacks. Try it now
Human Perception (Browser)
"Your Honor, I plead guilty."
AI Perception (ASR)
"I refuse to admit guilt."
Improper Alpha Fusion (WebP)
AI improperly handles the alpha channel, leading to perception of different content than what humans see (e.g., moderation bypass).
Human Perception
AI Perception
Improper Transparency Fusion
AI discards alpha or tRNS transparency data, while humans see the image correctly blended against a background (e.g., white).
Human Perception
AI Perception
Incorrect Content Choice
AI incorrectly selects the first track/frame from a multi-track file (e.g., HEIC), while humans see the primary track/frame.
Human Perception (Primary)
AI Perception (First)
Deterministic Image Sampling
AI processes only the first frame of an animation (e.g., GIF), while humans see the persistent second frame.
Human Perception
(Frame 2)
AI Perception (Frame 1)
Deterministic Video Sampling
AI deterministically samples a few frames (e.g., 1 per sec), while humans see the full video. Attackers can place malicious content in sampled frames.
Human Perception
Full video (mostly malicious)
AI Perception
Sampled frames (all benign)
(Warning: NSFW Content)
POC Video Showcase
Chatgpt-TRNS (Chatbot): Download
Gemini-alpha-avif (Chatbot): Download
Gemini-alpha-png (Chatbot): Download
Gemini-alpha-heic (Chatbot): Download
Qwen-mkv-multitrack (Chatbot): Download
Gemini-crop (Chatbot): Download
Grok-mirror (Chatbot): Download
Kimi-rotation (Chatbot): Download
Gemini-multiTrack (Chatbot): Download
Baidu-ocr (tRNS): Download
Azure-ocr (tRNS): Download
Qwen (Audio): Download
Tencent ASR (Audio): Download
Aliyun ASR (Audio): Download
Kimi (Audio): Download
Deepgram (Audio): Download
Gemini (Audio): Download
Aliyun Audio Moderation (Audio): Download (Warning: NSFW Content)
Tencent Audio Moderation (Audio): Download (Warning: NSFW Content)
Aliyun Video Audio Moderation (Audio): Download (Warning: NSFW Content)
Tencent Video Audio Moderation (Audio): Download (Warning: NSFW Content)
Baidu Content Moderation (Track): Download (Warning: NSFW Content)
Aliyun Content Moderation (Track): Download (Warning: NSFW Content)
Tencent Content Moderation (crop): Download (Warning: NSFW Content)
Tencent Content Moderation (Sampling): Download (Warning: NSFW Content)
Tencent Content Moderation (SVG): Download (Warning: NSFW Content)