Adversarial Attacks: Navigating the Shifting Sands of AI Security and Robustness
Latest 50 papers on adversarial attacks: Nov. 23, 2025
The landscape of Artificial Intelligence is evolving at breakneck speed, but with every advancement comes a new frontier of security challenges. Adversarial attacks, those insidious attempts to trick AI models with subtle, often imperceptible perturbations, remain a paramount concern across every domain, from computer vision to large language models and multi-agent systems. Recent research is not only uncovering novel attack vectors but also pioneering sophisticated defense mechanisms, pushing the boundaries of what it means to build truly robust and trustworthy AI. This post dives into a collection of recent breakthroughs, exploring how researchers are both breaking and fortifying our most advanced AI systems.
The Big Idea(s) & Core Innovations
At the heart of these recent papers lies a continuous cat-and-mouse game between attackers and defenders, with innovation stemming from both sides. A significant theme is the move beyond simple pixel-level perturbations to more sophisticated, semantically aware, and multi-modal attacks. For instance, the Q-MLLM framework from researchers at University of California, San Diego pioneers a novel quantization-based defense against adversarial attacks on multimodal large language models (MLLMs). Their key insight involves introducing discrete bottlenecks in visual features via vector quantization, effectively blocking adversarial gradient paths and achieving impressive defense rates against jailbreak and toxic image attacks.
On the attack front, works like “When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models” by researchers including Yuping Yan from TGAI Lab, School of Engineering, Westlake University, introduce VLA-Fool. This framework systematically reveals that even minor, cross-modal perturbations can significantly disrupt Vision-Language-Action (VLA) models, leading to substantial behavioral deviations. Similarly, the “Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models” from institutions like The Chinese University of Hong Kong demonstrates how the MFA framework can bypass multiple layers of VLM defenses by exploiting shared visual representations, achieving a 58.5% success rate against leading models.
The realm of Large Language Models (LLMs) is particularly active. “PSM: Prompt Sensitivity Minimization via LLM-Guided Black-Box Optimization” by Hussein Jawad and Nicolas J-B. Brunel from Capgemini Invent, Paris, France presents a lightweight, black-box method for shielding system prompts from extraction attacks by minimizing leakage while preserving utility. Complementing this, “ExplainableGuard: Interpretable Adversarial Defense for Large Language Models Using Chain-of-Thought Reasoning” from Shaowei GUAN and colleagues at The Hong Kong Polytechnic University introduces a defense mechanism that not only detects attacks but also provides transparent, step-by-step explanations, enhancing trustworthiness.
Beyond perception and language, multi-agent systems and robotics are also in the crosshairs. “Adversarial Attack on Black-Box Multi-Agent by Adaptive Perturbation” introduces AdapAM, a stealthy black-box attack leveraging proxy agents and adaptive selection policies. In robotics, the “Keep on Going: Learning Robust Humanoid Motion Skills via Selective Adversarial Training” paper introduces SA2RT, a novel selective adversarial training method that dramatically improves humanoid robot motion policy robustness in real-world environments.
Other notable innovations include: * TopoReformer (Manipal Institute of Technology, Manipal Academy of Higher Education Manipal) for OCR models, which uses topological purification to filter adversarial noise without adversarial training. * MedFedPure (Institute of Medical AI, University X) for federated medical AI, integrating MAE-based detection and diffusion purification against inference-time attacks. * MPD-SGR (Zhejiang University) which enhances Spiking Neural Networks (SNNs) adversarial robustness by regulating membrane potential distribution. * SHIFT (Tulane University), a diffusion-based attack for RL that generates semantically different yet realistic state perturbations. * CD-MTA (Tohoku University) for cross-domain multi-targeted adversarial attacks without victim model access.
Under the Hood: Models, Datasets, & Benchmarks
The innovations above are underpinned by a rich array of models, datasets, and benchmarks. Researchers are moving towards more complex, real-world relevant evaluations, often building new tools to achieve this:
- Q-MLLM: Leverages state-of-the-art MLLMs and a custom zero-shot classification setup for evaluating defense against jailbreak and toxic visual content. Publicly available code: https://github.com/Amadeuszhao/QMLLM
- PSM: Demonstrates black-box compatibility with any API-accessible LLM and uses an LLM-as-optimizer for guided search. Code available at https://github.com/psm-defense/psm.
- VLA-Fool: Evaluates robustness in multimodal Vision-Language-Action (VLA) models, assessing white-box and black-box settings, highlighting fragility to cross-modal misalignments.
- MFA: Targets leading commercial and open-source VLMs like GPT-4o and LlaMA 4, with code at https://github.com/cure-lab/MultiFacetedAttack.
- TopoReformer: A model-agnostic OCR defense tested on EMNIST and MNIST against a suite of attacks (FGSM, PGD, Carlini–Wagner, EOT, BDPA, FAWA). Code available at https://github.com/invi-bhagyesh/TopoReformer.
- DiffProtect: Utilizes diffusion models for generating adversarial examples, validated on CelebA-HQ and FFHQ datasets. Paper: https://arxiv.org/pdf/2305.13625
- SEBA: A two-stage framework for black-box attacks on visual reinforcement learning agents, tested on continuous-control (MuJoCo) and discrete-action (Atari) domains. Paper: https://arxiv.org/pdf/2511.09681
- MOS-Attack: A multi-objective adversarial attack framework evaluated on CIFAR-10 and ImageNet, discovering synergistic patterns among loss functions. Code: https://github.com/pgg3/MOS-Attack
- AlignTree: Lightweight classifier for LLM jailbreak defense, combining linear refusal directions with non-linear SVM-based signals. Code: https://github.com/Gilgo2/AlignTree
- UDora: A red-teaming framework for LLM agents leveraging adversarial string optimization, achieving high attack success rates on InjecAgent, WebShop, and AgentHarm. Code: https://github.com/AI-secure/UDora
- AdvRoad: A generative approach for creating naturalistic road-style adversarial posters to attack visual 3D detection in autonomous driving, evaluated on realistic scenarios. Code: https://github.com/WangJian981002/AdvRoad
Impact & The Road Ahead
These advancements have profound implications for the trustworthiness and deployment of AI systems. The revelations of vulnerabilities in multimodal, language, and robotic systems underscore the urgent need for proactive defense strategies. The shift towards black-box, stealthy, and semantically meaningful attacks means that traditional defenses are becoming obsolete, compelling researchers to develop more sophisticated, often geometry-aware or topologically-informed, countermeasures.
The focus on interpretability in defenses like ExplainableGuard, or provable repair mechanisms like ProRepair (Hangzhou Dianzi University, Zhejiang University), signals a maturing field prioritizing not just robustness, but also transparency and reliability. Furthermore, the exploration of new paradigms like ‘engineered forgetting’ (Institute for Artificial Intelligence, University of X) suggests a future where AI models can dynamically adapt and unlearn harmful information, aligning with ethical AI principles.
The road ahead will undoubtedly involve a continued arms race. However, by combining theoretical rigor with empirical validation, and by fostering open research with shared code and benchmarks, the AI community is better equipped than ever to build systems that are not only powerful but also fundamentally secure and resilient against the ever-evolving landscape of adversarial threats. The ongoing pursuit of robust, transparent, and aligned AI is crucial as these technologies become increasingly embedded in critical real-world applications.
Share this content:
Discover more from SciPapermill
Subscribe to get the latest posts sent to your email.
Post Comment