Vision-Language Models: Charting the Course from Visual Perception to Robust Reasoning
Latest 100 papers on vision-language models: May. 23, 2026
Vision-Language Models (VLMs) are at the forefront of AI innovation, bridging the gap between what machines see and what they understand. From conversational agents to autonomous robots, VLMs promise to revolutionize how we interact with the digital and physical worlds. However, this burgeoning field faces significant challenges, including ensuring robust visual grounding, mitigating hallucinations, and achieving precise spatial and temporal reasoning. Recent research offers exciting breakthroughs, pushing the boundaries of VLM capabilities and interpretability.
The Big Idea(s) & Core Innovations
Many recent papers highlight a critical insight: for VLMs to truly excel, they must move beyond superficial pattern matching to genuinely perceive and reason. Researchers are identifying and addressing the fundamental bottlenecks that prevent this, often by re-evaluating how visual information is represented and processed.
A recurring theme is the decoupling of perception and reasoning. In “From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models” by Juncheng Wu et al., a stark finding emerges: 86.9% of VLM reasoning failures stem from visual perception errors, not reasoning limitations. This work from UC Santa Cruz and Amazon proposes a staged post-training framework that explicitly solidifies visual perception before refining reasoning. Complementing this, “GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning” by Deshui Miao et al. from Harbin Institute of Technology, Shenzhen, argues that geometry should be a representational prerequisite, not a late fusion signal, shaping visual tokens before language reasoning. Their token-adaptive geometric evidence allocation from a multi-level geometry bank significantly boosts spatio-temporal reasoning.
Another core innovation focuses on enhancing interpretability and robustness. “Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models” by Piotr Kubaty et al. from Jagiellonian University introduces CEDAR, an invertible reparameterization method that reveals compositional semantic structure in VLM embeddings without increasing dimensionality, making them more interpretable. Addressing the pervasive issue of hallucinations, “Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens” by Meng Shen et al. from Nanyang Technological University finds that most hallucinations occur around image-invariant tokens influenced by language priors. They propose loss re-weighting and data filtering to mitigate this. Similarly, “Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy” by Yutong Xie et al. from Southeast University, presents ILVAD, a training-free method leveraging inter-layer attention discrepancies to identify and enhance attention to correct visual evidence, preventing “visual forgetting.” This directly counters the “illusion of visual re-examination” diagnosed by Chufan Shi et al. from University of Southern California, where VLMs merely say they are re-examining an image without actually seeing it, especially during self-reflection.
For efficient and reliable VLM deployment, several papers introduce novel architectural and training strategies. “Visual-Advantage On-Policy Distillation for Vision-Language Models” by Ruiqi Liu et al. from Institute of Automation, CAS, proposes VA-OPD, a distillation method that focuses supervision on the sparse, visual-critical tokens that truly drive multimodal reasoning, leading to faster and more visually-grounded student models. “Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth” by Yuhuan Wu et al. from Hong Kong University of Science and Technology, introduces a “perceptual starvation” training paradigm that forces models to actively use visual operations by restricting visual bandwidth, enabling transfer to unconstrained settings.
Under the Hood: Models, Datasets, & Benchmarks
The advancements above are supported by a rich ecosystem of new models, specialized datasets, and rigorous benchmarks:
- Conceptual Embedding Disentanglement via Adaptive Rotation (CEDAR): This method uses CLIP (ViT-L/14) and BLIP/CoCa for sparse disentanglement, validated on ImageNet-1K.
- GeoWeaver Framework: Leverages a frozen VGGT encoder and Qwen3.5 VLM backbone, evaluated on comprehensive spatio-temporal reasoning benchmarks like VSI-Bench, ReVSI, SPAR-Bench, and BLINK.
- Supervised Classification Heads as Semantic Prototypes: Repurposes ImageNet-21K pretrained classification heads with CLIP ViT-B/32 text encoder, evaluated on Flickr30K and multiple classification benchmarks. Code: https://github.com/david-mnd/recycling4vlalignment.
- AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation: Uses a unified reason-act framework on R2R-CE and RxR-CE datasets in the Habitat simulator for embodied AI. Project page: https://gwxuan.github.io/AwareVLN/.
- FundusGround: Clinically Interpretable Ophthalmic VQA: A new benchmark with 10,719 fundus images and 72,706 QA pairs, using ETDRS grid for spatial grounding. Evaluates models like GPT-4o and Qwen3-VL. Dataset and prompts will be open-sourced.
- GeoWeaver: Grounding Visual Tokens with Geometric Evidence: Uses a VGGT geometry encoder and Qwen3.5 VLM backbone, with code available at https://github.com/yahooo-m/GeoWeaver.
- PhysX-Omni: Unified Simulation-Ready Physical 3D Generation: Introduces PhysXVerse (8K+ assets) and PhysX-Bench, leveraging Qwen2.5-VL-7B-Instruct with a template-based RLE geometry representation. Official page: https://physx-omni.github.io/.
- FLAT-PACK BENCH: Furniture Assembly Tasks: A novel benchmark for spatio-temporal understanding, evaluating GPT-5, Gemini, Qwen, and InternVL3 families on video demonstrations. Project website: flat-pack-bench.github.io.
- JMed48k: Multi-Profession Japanese Medical Licensing Benchmark: A corpus of 48,862 questions and 20,142 images, evaluating 21 VLMs to identify visual evidence use in medical contexts.
- COCOTREE: Open Tree-Structured Visual Decomposition: A large-scale dataset (21K images, 1.8M nodes) generated by combining LVLMs with SAM 3. Code: https://github.com/melonkick3090/COCOTree.
- Visual-Advantage On-Policy Distillation: Distillation framework for Qwen3-VL (4B, 8B, 32B teachers) on Geometry3K and ViRL39K datasets, evaluated on MathVerse, HallusionBench, AI2D, etc.
- MM-Conv: Context-Aware Grounding in 3D Dialogue: A multimodal benchmark with 6.7 hours of VR interaction data. Evaluates Qwen2.5-VL and GroundingDINO. Code to be released with the dataset.
- BEiTScore: Reference-free Image Captioning Evaluation: Uses a lightweight cross-encoder based on BEiT-3 and introduces LongCapVLCP benchmark for long-form captions. Code: https://github.com/microsoft/unilm/blob/master/beit3/README.md.
- Thermo-VL: Extending VLMs to Thermal Infrared Perception: Integrates thermal infrared with RGB via Molmo-7B backbone, with Thermo-VL-Bench for evaluation. Code: thusharakart.github.io/Thermo-VL.
- WikiVQABench: Knowledge-Grounded VQA: A human-curated benchmark from Wikipedia and Wikidata, evaluating 15 VLMs (256M-90B) on knowledge-intensive reasoning. Dataset: https://huggingface.co/datasets/ibm-research/WikiVQABench.
- TempGlitch: Temporal Glitch Detection in Gameplay Videos: A benchmark of 750 temporal glitch videos, evaluating 12 VLMs (GPT-5, Claude, Gemini, Qwen, Gemma) on temporal reasoning failures.
- Reducing Object Hallucination in LVLMs: Validates methods across LLaVA-v1.5, PaliGemma, Bunny-v1.1. Paper: https://arxiv.org/pdf/2605.21300.
- MONET: Massive, Open, Non-redundant and Enriched Text-to-image dataset: 104.9M curated image-text pairs from multi-VLM re-captioning. Dataset: https://huggingface.co/datasets/jasperai/monet/.
- Finding the Correct Visual Evidence Without Forgetting (ILVAD): Training-free method for LLaVA-1.5, Qwen2-VL, Qwen3-VL, InternVL3. Code: https://github.com/ytx-ML/ILVAD.
- SPpruner: Subject-Centric Progressive Visual Token Reduction: Plug-and-play for LLaVA-1.5, Qwen2.5-VL models, tested on 22 benchmarks.
- ArchSIBench: Architectural Spatial Intelligence Benchmark: 3,000 QA pairs by experts, evaluating 27 VLMs against human baselines on floor plans and architectural drawings. Dataset: https://huggingface.co/datasets/ArchSIBench/ArchSIBench.
- RoboJailBench: Adversarial Attacks and Defenses in Embodied Robotic Agents: Benchmark of 6 frontier VLMs on 6,069 instructions across 5 robotics datasets. Code and project page: https://purseclab.github.io/RoboAbstention/.
- Ablate-to-Validate: Are VLMs Really Using Continuous Thought Tokens?: Diagnostic using LLaVA and Qwen2.5-VL with discrete/continuous depth tokens. Paper: https://arxiv.org/abs/2605.21642.
- Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA: Framework using active zooming with Qwen2.5-VL on liver, breast, and thyroid datasets.
- Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction: Framework for geometric reasoning with GeoGebra constraint engine. Project page: https://draw2think.github.io/.
- QwenSafe: Multimodal Content Rating Description Identification: A VLM for app content rating using Qwen3-VL-8B and metadata2CRD pipeline.
- From Seeing to Thinking (VLM-CapCurriculum): Project page: https://ucsc-vlaa.github.io/VLM-CapCurriculum/.
- CaMo: Camera Motion Grounded Evaluation and Training: Introduces Spatial Narrative Score (SNS) and CaMo-30K dataset to train CaMo-3B VLM. Code: https://github.com/hsiangwei0903/CaMo.
- VL-DPO: Vision-Language-Guided Finetuning for Autonomous Driving: Uses VLMs as zero-shot reasoners for Waymo Open End-to-End Driving Dataset.
- A-TPT: Attention-Guided Test-Time Prompt Tuning: Uses CLIP with ViT-B/16, ViT-B/32, ResNet50 on 9 datasets. Code: https://github.com/SEU-VIPGroup/A-TPT.
- SplitQ: Low-Bit Quantization for LVLMs: Post-training quantization for Qwen2.5-VL and LLaVA-v1.5 models. Code: https://github.com/EMVision-NK/SplitQ.
- Structured Layout Priors for Document Understanding: Uses RT-DETR with granite-docling-258M on OOD benchmarks.
- EyeVLM: Benchmarking Gaze Following and Social Gaze Prediction: Unified benchmark for gaze understanding in VLMs. Fine-tuning with LLaMAFactory. Paper: https://arxiv.org/pdf/2605.19859.
- FineBench: Fine-grained Human Activity Understanding: Densely annotated benchmark (199K QA pairs) from 64 videos, evaluates 15 VLMs including GPT-5.
- MSAlign: Aligning Molecule and Mass Spectra Foundation Models: Aligns frozen DreaMS and ChemBERTa models on NPLIB1, MassSpecGym, Spectraverse datasets. Paper: https://arxiv.org/pdf/2605.19752.
- PStar: Pseudocode-Guided Structured Reasoning: Training-free framework for Qwen2.5-VL-7B for robotic automation. Paper: https://arxiv.org/pdf/2605.19663.
- DPL-ReID: Dual-Prompt CLIP for Occluded Person Re-ID: Uses CLIP with Real-World Occlusion Augmentation. Code: https://github.com/stone-qiao/DPL-ReID.
- Investigating Cross-Modal Skill Injection: Evaluates Idefics2, LLaVA, Qwen2-VL with various merging methods. Paper: https://arxiv.org/pdf/2605.19523.
- Brain Alignment in Interactive Gameplay: Uses fMRI to evaluate VLMs and LAMs during Atari gameplay.
- RE-VLM: Event-Augmented VLM for Scene Understanding: Dual-stream RGB-Event VLM with PEOD-Chat and RGBE-Chat datasets. Code: https://github.com/bupt-ai-cz/RE-VLM.
- RoboJailBench: Benchmarking Adversarial Attacks in Embodied AI: Unified benchmark for adversarial attacks and defenses on embodied agents. Code: https://purseclab.github.io/benchmark-for-robotics-security/.
- iGSP: Implicit Gradient Subspace Projection for Continual Learning: Framework for CLIP models on MTIL benchmark. Code: https://github.com/GeoX-Lab/iGSP.
- RotateK: Rotation-Aligned Key Channel Pruning: Pruning framework for LLaVA-NeXT-8B, Qwen2.5-VL-7B-Instruct on various VQA benchmarks.
- EgoBabyVLM: Cross-Modal Learning from Naturalistic Egocentric Video: Challenge suite with BabyView 2025.1 (863 hours of video) and Machine-DevBench. Code: https://github.com/facebookresearch/egobabyvlm.
- MedFM-Robust: Benchmarking Robustness of Medical Foundation Models: Evaluates LLaVA-Med, MedGemma, Gemini-2.5-flash, GPT-4o-mini on 40 perturbation types across 8 modalities. Code: https://github.com/AbnerAI/MedFM-Robust.
- Reasoning Portability: Guiding Continual Learning for MLLMs: RLVR-based CL for MLLMs on VizWiz, IconQA, ScienceQA datasets. Code: https://github.com/lluosi/RDB-CL.
- INAR-VL: Input-Aware Routing for Edge-Cloud VLM Inference: Routing system for Qwen-VL and LLaVA-OV on VQAv2, TextVQA, GQA.
- VT-Bench: Unified Visual-Tabular Multi-Modal Learning Benchmark: First unified benchmark covering discriminative/generative tasks across 14 datasets. Code: https://github.com/Ziyi-Jia990/VT-Bench.
- CATA: Continual Machine Unlearning via Conflict-Averse Task Arithmetic: Framework for CLIP models on ImageNet, CIFAR, Food-101.
- Not What You Asked For: Typographic Attacks in Household Robot Manipulation: Simulation with Habitat and HomeRobot, using CLIP and DETIC.
- VISAFF: Speaker-Centered Visual Affective Feature Learning: Tuning-free framework for Qwen2-VL/Qwen3-VL-Embedding on MELD and IEMOCAP. Code: https://anonymous.4open.science/r/speaker-2365/.
- PERL: Parameter Efficient Reasoning in CLIP Latent Space: Lightweight adaptation for CLIP (ViT-B/16) on 15 benchmarks. Paper: https://arxiv.org/pdf/2605.18464.
- What is Holding Back Latent Visual Reasoning?: Causal intervention experiments on LanteRn, LVR, Monet, ILVR on **VisCoT, BLINK, V*Bench**.
- GAUC: Geometry-Aware Uncertainty Coresets for Histopathology: Training-free coreset selection for Qwen VLM, LLaVA VLM on CRC-100K, MHIST. Paper: https://arxiv.org/pdf/2605.18419.
- Wasserstein Equilibrium Decoding for Medical VQA: Extends game-theoretic decoding to VLMs on VQA-RAD, PathVQA. Code: https://github.com/luca-hagen/Wasserstein-BDG-medical-VQA.
- SPATIOROUTE: Dynamic Prompt Routing for Zero-Shot Spatial Reasoning: Dynamic prompt generation for Qwen3-2B/4B on SQA3D benchmark. Paper: https://arxiv.org/pdf/2605.18209.
- Self-Evolving Spatial Reasoning (SAGE): Self-evolving post-training for VLMs on 6 video understanding and 7 spatial reasoning benchmarks. Paper: https://arxiv.org/pdf/2605.18162.
- SkyNative: Native Multimodal Remote Sensing Framework: Encoder-free VLM on AID, VRSBench, DOTA-val, XLRS benchmarks. Paper: https://arxiv.org/pdf/2605.17949.
- CounterCount: Diagnostic for Counting Bias: Dataset with factual/counterfactual image pairs, evaluates Qwen3-VL, Gemma3 families. Paper: https://arxiv.org/pdf/2605.17826.
- CosFly-Track: Large-Scale Multi-Modal Dataset for UAV Visual Tracking: 12K trajectories, 7 data channels. Paper: https://arxiv.org/pdf/2605.17776.
- WJoconde: Multimodal Cultural Heritage Knowledge Graph: New KG benchmark for French cultural heritage, uses LLaMA and BLIP. Paper: https://arxiv.org/pdf/2605.17669.
- When a Zero-Shooter Cheats: Improving Age Estimation: Identifies ‘identity shortcut’ in VLMs using CelebA, AgeDB, FG-Net datasets. Paper: https://arxiv.org/pdf/2605.17658.
- SafeLens: Deliberate and Efficient Video Guardrails: Fast-and-slow inference architecture using influence-guided data curation from SafeWatch. Paper: https://arxiv.org/pdf/2605.17610.
- TAME: Test-Time Adversarial Prompt Tuning: Mixture-of-Experts defense for CLIP on 11 datasets. Paper: https://arxiv.org/pdf/2605.17577.
- Employing Vision-Language Models for Face Image Quality Assessment: Benchmarks QWEN, Gemma, Idefics, Phi on CelebA-HQ, LFW, IJB-B, SCFace. Code: github.com/ThEnded32/VLM4FIQA.git.
- FastOCR: Dynamic Visual Fixation for Efficient Document Parsing: KV cache pruning for Qwen2.5-VL, dots.ocr, DeepSeek-OCR, olmOCR, LLaVA-OneVision. Paper: https://arxiv.org/pdf/2605.17447.
- Medical Context Distorts Decisions in Clinical VLMs: Evaluates LLaVA, Qwen2.5-VL, MedGemma, GPT-5, Gemini 3 Pro on MIMIC-CXR. Code: https://github.com/dsrestrepo/context-distortion-vlms.
- Single-Sample Black-Box Membership Inference Attack (CSA-MIA): Attack against LLaVA-1.5, MiniGPT-4, GPT-4, Claude-3 using CLIP encoder. Code: https://anonymous.4open.science/r/CSA-MIA-F2B200110/.
- Attention Hijacking: Response Manipulation Across Queries: Adversarial attack on LLaVA-1.5, InternVL-2.5, Qwen2.5-VL, DeepSeek-VL. Paper: https://arxiv.org/pdf/2605.17310.
- Event-Grounded Sparse Autoencoders for VLA Policies: Interpretabilty for OpenVLA and π0.5 PaliGemma. Paper: https://arxiv.org/pdf/2605.17204.
- PluRule: Benchmark for Moderating Pluralistic Communities: Multimodal, multilingual benchmark on Reddit, evaluates GPT-5.2 and Qwen3-VL. Dataset: https://hf.co/datasets/osome-iu/PluRule.
- UCSF-PDGM-VQA: Brain Tumor MRI Interpretation: VQA dataset for 3D brain MRI, benchmarks LLaVA-Med, MedGemma, GPT5-mini. Code: https://anonymous.4open.science/r/VLM-Brain-Tumor-QA-pipeline-65BD/.
- HEED: Density-Weighted Residual Alignment for Hybrid VLM Distillation: Distillation for Qwen3-VL, InternVL-3.5, MiniCPM-V, GLM with hybrid Mamba/attention architectures. Paper: https://arxiv.org/pdf/2605.17093.
- EPIC-Bench: Embodied Visual Grounding: Benchmark with 6,661 annotated samples, evaluates 89 VLMs. Paper: https://arxiv.org/pdf/2605.17070.
- Structured Labeling for Autonomous Driving (NuScenes-S, FastDrive): Dataset with machine-friendly key-value pairs, and compact FastDrive VLM (0.9B params). Paper: https://arxiv.org/pdf/2506.05442.
- CrossMPI: Cross-Modal Prompt Injection Attack: Image-only perturbation attack on MiniGPT4, InstructBLIP, BLIP-2, BLIVA, Qwen2.5-VL. Paper: https://arxiv.org/pdf/2605.16090.
- MIND: Decoupling Model-Induced Label Noise: Framework using OpenSeg for zero-shot experiments on 3D semantic segmentation. Paper: https://arxiv.org/pdf/2605.16081.
- VideoSeeker: Incentivizing Instance-level Video Understanding: Agentic paradigm with visual prompts for GPT-4o, Gemini-2.5-Pro. Project page: https://gaotiexinqu.github.io/VideoSeeker/.
- Segmentation, Detection and Explanation: CT Appearance Reasoning: Unified autoregressive framework using LVLMs with BTCV++ dataset. Paper: https://arxiv.org/pdf/2605.15997.
- PAGER: Bridging the Semantic-Execution Gap in Geometric GUI Control: Framework for pixel-precise GUI tasks with PAGE Bench. Code: https://github.com/opengoa/PAGER.
- Sparse Autoencoders for CLIP Fine-tuning (SAE-FT): Fine-tuning for CLIP models on ImageNet and distribution shift benchmarks. Code: https://github.com/Fabian-Mor/sae-ft.
- Group Revision for Object-Level Grounding: Group-revision paradigm for LVLM fine-tuning on RefCOCOg, VisionReasoner7K, ReasonSeg, CountBench. Code: https://github.com/yyliu01/GroupRevision.
- DepthVLM: Unlocking Dense Metric Depth Estimation: Framework transforming VLMs into native depth predictors, with DepthVLM-Bench. Project page: https://depthvlm.github.io/.
- BiomedAP: Dual-Anchor Framework for Medical VLM Adaptation: Framework for BiomedCLIP-PubMedBERT on 11 medical benchmarks. Code: https://github.com/tongdiedie/BiomedAP.
- EntropyScan: Backdoor Detection in LVLMs: Lightweight trigger-agnostic detection method using Tsallis entropy analysis. Paper: https://arxiv.org/pdf/2605.15711.
- VCG-Bench: Unified Benchmark for Visual-Centric Structured Generation and Editing: Benchmark for diagram generation/editing using mxGraph XML. Paper: https://arxiv.org/pdf/2605.15677.
Impact & The Road Ahead
These advancements have profound implications across diverse fields. In robotics and embodied AI, models are becoming more capable of self-aware navigation (“AwareVLN”), precise manipulation (“GaussianDream”, https://arxiv.org/pdf/2605.20752), and even discerning when not to act (“The Yes-Man Syndrome”, https://arxiv.org/pdf/2605.20544). However, the vulnerability of these systems to typographic attacks (“Not What You Asked For”, https://arxiv.org/pdf/2605.18593) and the challenge of integrating structured reasoning for autonomous driving (“Bridging Structure and Language”, https://arxiv.org/pdf/2605.20942) highlight the critical need for robust, trustworthy AI.
Medical AI is seeing significant progress in VQA for complex tasks like brain tumor MRI interpretation (“UCSF-PDGM-VQA”, https://arxiv.org/pdf/2605.17140) and chest X-ray analysis (“HalluCXR”, https://arxiv.org/pdf/2605.20469). The emphasis on clinical interpretability, mitigating hallucinations, and ensuring factuality (“Regulating Anatomy-Aware Rewards”, https://arxiv.org/pdf/2605.20277) is crucial for safe deployment. Yet, the persistent issue of modality over-reliance on text over images (“Medical Context Distorts Decisions”, https://arxiv.org/pdf/2605.17436) remains a major concern.
Beyond specific applications, the drive for interpretability, efficiency, and robustness is a unifying theme. Techniques like sparse disentanglement (“Conceptualizing Embeddings”), attention steering (“When a Zero-Shooter Cheats”, https://arxiv.org/pdf/2605.17658), and innovative data distillation (“MONET”, https://arxiv.org/pdf/2605.21272; “SafeLens”, https://arxiv.org/pdf/2605.17610) are making VLMs more transparent, practical, and resilient against attacks. The shift towards dynamic and adaptive prompting (“SPATIOROUTE”, https://arxiv.org/pdf/2605.18209) and self-evolving frameworks (“RISE”, https://arxiv.org/pdf/2605.20914; “SAGE”, https://arxiv.org/pdf/2605.18162) indicates a future where VLMs are not just larger, but fundamentally smarter and more adaptable.
The journey from “seeing” to “thinking” for Vision-Language Models is far from over. These papers collectively highlight that achieving genuine visual grounding, nuanced reasoning, and robust performance demands a holistic approach, encompassing improved architectural designs, meticulously curated data, and novel training paradigms. The coming years promise even more exciting breakthroughs as researchers continue to unravel the complexities of multimodal intelligence.
Share this content:
Post Comment