Cisco Reveals Security Gaps in Vision Language Models | eSecurity Planet

Cisco Reveals Security Gaps in Vision Language Models

Cisco researchers found that tiny, nearly invisible image perturbations can bypass vision language model safety mechanisms.

Written By
Ken Underhill
Ken Underhill
May 8, 2026
5 minute read
eSecurity Planet content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Vision language models (VLMs) continue to expand the capabilities of artificial intelligence by combining image and text understanding into a single system. 

However, recent research from Cisco into typographic prompt injection attacks highlights significant weaknesses in how these models interpret and secure visual information. 

The second installment of Reading Between the Pixels explores how small image perturbations can manipulate VLM behavior, revealing two distinct safety failure modes: readability recovery and refusal reduction.

“Because many modern AI models can ‘read’ images, we found that we could make tiny, imperceptible tweaks to an image that bypass model guardrails and alignment, flipping a model response from a refusal to compliance,” said Amy Chang, Head of AI Threat Intelligence and Security Research at Cisco, in an email to eSecurityPlanet.

She added, “This research is a reminder that AI models can also be tricked through pictures, not just words. It’s critical that people understand that AI security measures should extend beyond text-only protections and consider how we can also secure other modalities.”

Key Takeaways From the Research

  • Cisco researchers found that small image perturbations can bypass vision language model (VLM) safety mechanisms without visibly altering the image to humans.
  • The study identified two major VLM failure modes: readability recovery and refusal reduction.
  • Attack success rates increased significantly after optimization, including Claude Sonnet 4.5 improving from 0% to 28% under heavy blur conditions.
  • Researchers demonstrated that degraded images can evade OCR detection while still remaining machine-readable to AI models.
  • The findings suggest organizations need security defenses that protect representation space, not just pixel-level image analysis.

How Optimized Image Perturbations Affected VLM Security 

Security Finding What Researchers Observed Why It Matters
Embedding distance impacts attack success Images semantically closer to text increased ASR VLMs remain vulnerable to typographic prompt injection
Small perturbations restored readability Blurred or tiny text became interpretable to models OCR filters may fail to detect harmful content
Refusal reduction occurred Models shifted from refusal to compliance Safety alignment can break under subtle image changes
Attack transferability was possible Perturbations generalized across multiple models Proprietary models may be vulnerable without direct access
Human visibility remained low Images still looked distorted to humans Attackers can evade both users and automated defenses
Advertisement

How Embedding Distance Impacts Attack Success 

The first phase of the research established a strong correlation between text-image embedding distance and attack success rate (ASR). 

Embedding distance refers to how closely a model associates an image with its intended textual meaning in representation space. 

The researchers found that when typographic images drifted farther from their source text because of blur, rotation, or reduced font size, attack success rates declined.

Conversely, images positioned closer in embedding space produced more successful attacks.

Researchers Tested Targeted Optimization Techniques 

Building on this finding, the second phase investigated whether targeted optimization could intentionally reduce embedding distance and revive failed attacks. 

Researchers applied small, bounded perturbations to degraded images to make them appear semantically closer to their original text within the model’s internal representation system. 

The optimization process relied on multiple multimodal embedding models, including Qwen3-VL-Embedding, JinaCLIP v2, OpenAI CLIP ViT-L/14-336, and SigLIP SO400M, without requiring access to the target VLM itself.

The methodology adapted a Spectrum Simulation Attack with Common Weakness Attack (SSA-CWA) framework. 

Over 100 optimization steps, perturbations were constrained to a maximum of 12.5% pixel alteration, enabling subtle image modifications that remained visually difficult for humans or OCR systems to interpret. 

Researchers evaluated the attacks against GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL-4B using heavily degraded typographic images, including 6-pixel fonts, 8-pixel fonts, 90-degree rotations, heavy blur, and combinations of blur, noise, and low contrast.

Attack Success Rates Increased Across Multiple Models 

The results demonstrated that optimized perturbations significantly increased attack success rates in low-baseline scenarios. 

Claude Sonnet 4.5 improved from 0% to 28% ASR under heavy blur conditions, while GPT-4o increased from 0% to 16% under rotated text conditions. 

These findings suggest that carefully engineered perturbations can bypass both OCR-based detection systems and portions of VLM safety alignment mechanisms.

Two Key Failure Modes Emerged 

Researchers identified two major failure modes that explain how these attacks succeed. 

Readability Recovery Weakens Model Defenses 

The first, readability recovery, occurs when perturbations restore a model’s ability to interpret degraded text. 

For example, GPT-4o initially failed to process many 6-pixel font samples because the text was unreadable. 

After optimization, readability improved substantially, although GPT-4o’s refusal mechanisms still blocked most harmful requests. 

In contrast, Claude Sonnet 4.5 not only regained readability under heavy blur conditions but also complied with many harmful prompts after optimization, demonstrating weaker downstream safety enforcement once text became interpretable.

Advertisement

Refusal Reduction Creates Greater Security Risks 

The second and more concerning failure mode is refusal reduction.

In these scenarios, the VLM can already partially read the text but initially refuses to comply with harmful instructions. 

Small perturbations then alter the model’s internal reasoning process, shifting outputs from refusal to compliance without improving human-visible legibility. 

This behavior was particularly noticeable in rotated text and 8-pixel font conditions, where optimized perturbations reduced refusal rates and increased successful attacks despite minimal perceptual differences to human observers.

Implications for Organizations 

From a cybersecurity perspective, these findings reveal two exploitable artifacts. 

First, attackers can generate images that appear illegible to humans and OCR-based filters while remaining machine-readable to VLMs. 

Second, perturbations learned from successful attacks can transfer across models and configurations, allowing attackers to suppress safety refusals without needing access to proprietary model internals. 

Together, these artifacts create a practical attack chain that enables both detection evasion and compliance manipulation.

Why Representation-Space Security Matters 

The implications for practitioners are substantial. Current safety mechanisms often focus on pixel-level detection or OCR filtering, assuming that unreadable images are inherently safe. 

However, this research demonstrates that representation-space vulnerabilities can allow malicious semantic content to survive even when visual readability is lost. 

Defensive strategies must therefore extend beyond surface-level image analysis and incorporate safeguards that are robust within embedding and reasoning spaces.

Ultimately, Reading Between the Pixels underscores the growing complexity of multimodal AI security. 

While embedding distance offers a valuable framework for understanding typographic prompt injection, the ability to weaponize small perturbations against safety alignment systems reveals fundamental weaknesses in current VLM architectures. 

As multimodal AI adoption accelerates, organizations deploying these systems must prioritize defenses capable of resisting adversarial manipulation at both the visual and representational levels.

Ken Underhill

Ken Underhill is an award-winning cybersecurity professional, bestselling author, and seasoned IT professional. He holds a graduate degree in cybersecurity and information assurance from Western Governors University and brings years of hands-on experience to the field.

eSecurity Planet Logo

eSecurity Planet is a leading resource for IT professionals at large enterprises who are actively researching cybersecurity vendors and latest trends. eSecurity Planet focuses on providing instruction for how to approach common security challenges, as well as informational deep-dives about advanced cybersecurity topics.

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.