Multimodal Large Language Models: Navigating 3D Space, Enhancing Reasoning, and Ensuring Safety
Latest 80 papers on multimodal large language models: Mar. 21, 2026
Multimodal Large Language Models (MLLMs) are rapidly transforming the AI landscape, bridging the gap between human-like perception and complex reasoning across diverse data types. From understanding the nuances of human emotion to navigating autonomous vehicles, MLLMs promise a future where AI interacts with the world in richer, more intuitive ways. However, this burgeoning field faces significant challenges, particularly in areas like robust 3D spatial understanding, mitigating hallucinations, ensuring safety in critical applications, and achieving efficient, aligned learning. Recent research offers exciting breakthroughs, pushing the boundaries of what these powerful models can achieve.
The Big Idea(s) & Core Innovations:
A central theme emerging from recent papers is the pursuit of more grounded, reliable, and efficient multimodal reasoning. Many works are tackling the challenge of 3D spatial understanding, which is crucial for embodied AI and real-world applications. For instance, H-EmbodVis and OpenAI’s work on Generation Models Know Space: VEGA-3D reveals that modern video generation models implicitly encode 3D geometry and physical dynamics, suggesting a path to leveraging these ‘generative priors’ for improved scene understanding without explicit 3D supervision. Building on this, Microsoft Research, MIT CSAIL, and Columbia University’s Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models enhances 2D Vision-Language Models for advanced 3D reasoning by integrating geometric consistency and situational awareness from monocular video. Similarly, researchers from Tsinghua University and Microsoft Research in Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding introduce Motion-MLLM, integrating egomotion data from IMUs to allow MLLMs to reason about absolute scale and spatial relationships efficiently, without explicit 3D representations.
Beyond 3D perception, a critical area of innovation is in refining MLLM reasoning processes. Tsinghua University and Huawei Technologies’s Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models identifies ‘attention dispersion’ as a root cause of poor performance in visual reasoning tasks and proposes VRGA to guide models to focus on relevant visual regions. Reinforcing this push for smarter reasoning, The Chinese University of Hong Kong and Shanghai Artificial Intelligence Laboratory introduce SophiaVL-R1 in SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward, a novel approach that uses ‘thinking reward’ signals during reinforcement learning to improve MLLMs’ reasoning quality and generalization. Meanwhile, Accio Team, Alibaba Group, and Zhejiang University’s MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning highlights the struggle of even strong models with deep compositional reasoning, paving the way for more robust evaluations.
Another significant thrust is improving MLLM reliability and safety. In the medical domain, Tsinghua University and Baidu Inc.’s Concept-to-Pixel: Prompt-Free Universal Medical Image Segmentation introduces C2P, a prompt-free framework for universal medical image segmentation that disentangles anatomical reasoning, achieving zero-shot generalization. Addressing real-world medical vulnerabilities, a paper on CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models by Xiang Chen et al. introduces a framework that simulates realistic clinical pipeline shifts to test the robustness of medical vision-language models, proposing post-hoc repair strategies. For broader safety, Fudan University and Ant Group’s OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences shifts the safety paradigm from intent detection to causal projection, introducing the CASPO framework to enhance reasoning about hidden consequences.
Under the Hood: Models, Datasets, & Benchmarks:
The advancements above are underpinned by innovative models, extensive datasets, and rigorous benchmarks. Here’s a snapshot of key resources:
- VEGA-3D Framework: Repurposes video generation models as Latent World Simulators for MLLMs, with code available at https://github.com/H-EmbodVis/VEGA-3D.
- GeoAux-Bench & A2PO: Introduced by Fudan University and Peking University, Thinking with Constructions: A Benchmark and Policy Optimization for Visual-Text Interleaved Geometric Reasoning is the first benchmark aligning textual construction steps with ground-truth visual updates for geometric reasoning. A2PO is a reinforcement learning framework for strategic visual construction.
- CoDA Framework: Simulates multi-stage clinical pipeline shifts for robustness evaluation of CLIP-style MVLMs on brain MRI, chest X-ray, and abdominal CT. No public code yet.
- VLM-AutoDrive Framework: From NVIDIA, VLM-AutoDrive: Post-Training Vision-Language Models for Safety-Critical Autonomous Driving Events adapts general-purpose VLMs for autonomous driving event detection, leveraging diverse supervision signals. No public code.
- DEAF Benchmark: Introduced by University of Oxford and XJTLU, DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models evaluates acoustic faithfulness in Audio MLLMs, available at https://github.com/elevenlabs/elevenlabs-python.
- SkeletonLLM & DrAction: From Peking University and Tencent, Universal Skeleton Understanding via Differentiable Rendering and MLLMs translates skeleton data into MLLMs’ visual modality via differentiable rendering for universal understanding. Code available via Semantic Scholar at https://api.semanticscholar.org/CorpusID:273662196.
- Loc3R-VLM Framework: Equips 2D VLMs with advanced 3D understanding from monocular video, with code at https://kevinqu7.github.io/loc3r-vlm.
- C2P Framework: A prompt-free universal medical image segmentation model with code at https://github.com/Yundi218/Concept-to-Pixel.
- FINER Benchmarks & FINER-Tuning: Technical University of Munich and Google’s FINER: MLLMs Hallucinate under Fine-grained Negative Queries introduces benchmarks for hallucination and a DPO-based fine-tuning approach, with resources at https://explainableml.github.io/finer-project/.
- Hybrid-Frame Strategy: Explored by SJTU and HKUST(GZ) in Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models to mitigate image-video trade-off under video fine-tuning, with code at https://github.com/LLM-Research-HKUST/Hybrid-Frame.
- EvoGuard Framework: An agentic RL-based framework for evolving AI-generated image detection, as presented in EvoGuard: An Extensible Agentic RL-based Framework for Practical and Evolving AI-Generated Image Detection. No public code yet.
- FineViT & FineCap-450M: From Huawei Inc., FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions introduces a high-resolution vision encoder and the largest fine-grained annotated dataset, with code at https://github.com/PeisenZhao/FineViT.
- From Drop-off to Recovery: This Tongyi Lab, Alibaba Group paper (From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs) analyzes segmentation in MLLMs, revealing representation drop-off and self-refinement. No public code.
- Generalist Multimodal LLMs Gain Biometric Expertise via Human Salience: University of Notre Dame demonstrates the application of MLLMs for iris PAD using structured prompts, with code at https://github.com/CVRL/Multimodal-LLMs-Biometric-Expertise.
- PaAgent: A reinforcement learning-based framework for portrait-aware image restoration, code available at https://github.com/PAgent-Team/PaAgent.
- OpenQlaw: An agentic AI assistant for 2D quantum materials analysis by University of Arkansas, code at https://github.com/openclaw/openclaw.
- AgriChat & AgriMM: Khalifa University’s AgriChat: A Multimodal Large Language Model for Agriculture Image Understanding introduces a specialized MLLM and instruction-tuning dataset for agriculture, available at https://github.com/boudiafA/AgriChat.
- SurgΣ-DB: A large-scale multimodal data foundation for surgical intelligence by NUS, CUHK, SJTU, and NVIDIA, presented in SurgΣ: A Spectrum of Large-Scale Multimodal Data and Foundation Models for Surgical Intelligence. No public code.
- MLLM-based Textual Explanations for Face Comparison: University of Technology, Sydney’s work on MLLM-based explanations for face recognition, code at https://github.com/redwankarimsony/LR-MLLMFR-Explainability.
- HyDRA Framework: For Open-Vocabulary Multimodal Emotion Recognition, proposed by University of Chinese Academy of Sciences in Follow the Clues, Frame the Truth: Hybrid-evidential Deductive Reasoning in Open-Vocabulary Multimodal Emotion Recognition.
- GAP-MLLM: A geometry-aligned pre-training paradigm for 3D spatial perception in MLLMs by J. Zhang et al., as described in GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models. No public code.
- InViC Framework: From Alibaba Group and Tsinghua University, InViC: Intent-aware Visual Cues for Medical Visual Question Answering enhances Med-VQA, with code at https://github.com/alibaba-damo-academy/MedEvalKit.
- NeSy-Route Benchmark: A neuro-symbolic benchmark for constrained route planning in remote sensing, presented in NeSy-Route: A Neuro-Symbolic Benchmark for Constrained Route Planning in Remote Sensing. No public code.
- VisBrowse-Bench: For benchmarking visual-native search for multimodal browsing agents, with code at https://github.com/ZhengboZhang/VisBrowse-Bench.
- FrameRepeat Framework: From Tsinghua University and Alibaba Group, When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition enhances video reasoning, with code at https://github.com/Qwen/QwenVL.
- 360Bench & Free360: Tohoku University and RIKEN AIP’s 360° Image Perception with MLLMs: A Comprehensive Benchmark and a Training-Free Method introduces a benchmark and a training-free framework for 360° image VQA. Code provided in paper.
- ChartIR: A training-free framework for chart-to-code generation, code available in the paper at https://arxiv.org/abs/2506.14837.
- SophiaVL-R1: Reinforces MLLM reasoning with thinking reward, code at https://github.com/kxfan2002/SophiaVL-R1.
- EscapeCraft-4D Environment: For evaluating time awareness and cross-modal active perception in large models, with code at https://github.com/THUNLP-MT/EscapeCraft-4D.
- GUI-CEval: A hierarchical Chinese benchmark for mobile GUI agents, as described in GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents. No public code.
- CAMD: A decoding method for efficient multimodal reasoning, with code at https://github.com/THU-IR/CAMD.
- MVX-Bench & SAMA: University of Texas at Dallas’s A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding introduces a benchmark for multi-video understanding and an agentic framework. Dataset available at https://huggingface.co/datasets/MVX-bench/MVX-Bench.
- ViDscribe Platform: For customizing audio descriptions and VQA in online videos, as presented in ViDscribe: Multimodal AI for Customizing Audio Description and Question Answering in Online Videos. No public code.
- GroundSet Dataset: A cadastral-grounded dataset for spatial understanding with vector data in Earth Observation, described in GroundSet: A Cadastral-Grounded Dataset for Spatial Understanding with Vector Data. No public code.
- Fine-tuning MLLMs Without Forgetting Is Easier Than You Think: Stanford University and Tsinghua University propose strategies to mitigate catastrophic forgetting, with code at https://github.com/lihe50hz/MLLM-Forgetting.
- ES-Merging: From KAIST, ES-Merging: Biological MLLM Merging via Embedding Space Signals is an innovative method for combining specialized MLLMs. No public code.
- UAVBench & UAVIT-1M: Northwestern Polytechnical University introduces a benchmark and dataset for low-altitude UAV vision-language understanding, code at https://UAVBench.github.io/.
- ECG-Reasoning-Benchmark: A multi-turn evaluation framework for clinical reasoning in ECG interpretation, with code at https://github.com/Jwoo5/ecg-reasoning-benchmark.
- VGMED Dataset & VGRefine: Singapore University of Technology and Design proposes a dataset and method to improve visual grounding in medical MLLMs, code at https://guimeng-leo-liu.github.io/Medical-MLLMs-Fail/.
- MMOU Benchmark: From NVIDIA and University of Maryland, MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos for audio-visual reasoning in long videos. Dataset at https://huggingface.co/datasets/nvidia/MMOU.
- Visual Privacy Preservation Framework: For black-box MLLMs, proposed in When Visual Privacy Protection Meets Multimodal Large Language Models. No public code.
- EviAgent: An evidence-driven agent for radiology report generation by Sun Yat-sen University, code provided in the paper at https://arxiv.org/pdf/2603.13956.
- SpatialMed Benchmark: From Aalto University and Carnegie Mellon University, Beyond Medical Diagnostics: How Medical Multimodal Large Language Models Think in Space is the first comprehensive benchmark for evaluating MLLMs’ spatial intelligence in medical imaging. No public code.
- AD-Copilot: A vision-language assistant for industrial anomaly detection by Jiao Tong University and **A*STAR**, code at https://github.com/jam-cc/.
- VAEX-Bench: For evaluating extractive and abstractive spatiotemporal reasoning in egocentric videos, as presented in Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence. No public code.
- Dyn-Bench & ST-TCM: From XMU and THU, Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World introduces a benchmark for dynamic understanding and a framework for spatio-temporal reasoning. Code at https://dyn-bench.github.io/.
- EGOPOINTVQA & HINT: Imperial College London and Huawei Noah’s Ark Lab introduce a dataset for gesture-based egocentric video question answering and a method for encoding hand gestures, code at https://yuuraa.github.io/papers/choi2026egovqa.
- SPARROW: A video MLLM that improves referential stability and temporal coherence, with code at https://github.com/RISys-Lab/SPARROW.
- SvfEye: A training-free semantic-visual fusion framework for fine-grained reasoning, with code at github.com/Xiaoxiang100/SvfEye-final.
- EndoCoT: A diffusion framework for chain-of-thought reasoning, with code at https://github.com/InternLM/EndoCoT.
- ForensicZip: A training-free inference acceleration framework for forensic VLMs, with code at https://github.com/laiyingxin2/ForensicZip.
- LatentGeo & GeoAux: From HKUST(GZ), LatentGeo: Learnable Auxiliary Constructions in Latent Space for Multimodal Geometric Reasoning introduces a framework and benchmark for geometric reasoning using latent tokens. Code and benchmark at https://github.com/Ethylyikes/LatentGeo.
- EgoIntent Benchmark: From Google DeepMind, EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next for step-level intent understanding in egocentric videos. No public code.
- Hoi3DGen: A framework for generating high-quality 3D human-object interactions from text, code at https://github.com/black-forest-labs/flux and https://github.com/3DTopia/OpenLRM.
- EvoTok: A unified image tokenizer via residual latent evolution, with code at https://github.com/VisionXLab/EvoTok.
- Think While Watching Framework: From Chinese Academy of Sciences and BAAI, Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models for multi-turn video reasoning over streaming data. No public code.
- ZeroSense Benchmark: For evaluating visual-text compression methods, with code at https://github.com/MedHK23/ZeroSense.
- Explicit Logic Channel (ELC): From **A*STAR, Singapore, Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks** for validating and enhancing MLLMs on zero-shot tasks without ground-truth annotations. No public code.
- MT-RL-Judge: A multi-task reinforcement learning framework for MLLM evaluation, with code at https://github.com/hiyouga/EasyR1.
- MaterialFigBENCH: From AIST, Japan, MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models for evaluating MLLMs on materials science problems with scientific figures. Code and dataset at https://huggingface.co/omron-sinicx.
- DRIVEXQA Dataset & MVX-LLM: From TU Darmstadt and Tsinghua University, DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding is a new dataset and architecture for cross-modal VQA in adverse driving scenarios. No public code.
- ReasonMap Benchmark: From Westlake University and National University of Singapore, ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps for fine-grained visual reasoning from transit maps. No public code.
- OrchMLLM Framework: From ByteDance Seed and Peking University, OrchMLLM: Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training accelerates MLLM training. No public code.
- DIPE (Inter-Modal Distance Invariant Position Encoding): From Tencent Hunyuan Team, Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding addresses visual fading, with code at https://github.com/lchen1019/DIPE.
- GeoSense Framework: From University of Science and Technology of China, GeoSense: Internalizing Geometric Necessity Perception for Multimodal Reasoning enables MLLMs to autonomously decide when to use geometric information. Code at https://water-wood-rain.github.io/Geosense/.
- PathMem Framework: From University of Science and Technology, PathMem: Toward Cognition-Aligned Memory Transformation for Pathology MLLMs is a memory-centric framework for pathology MLLMs. No public code.
- FetalAgents: A multi-agent system for fetal ultrasound analysis, with code at https://github.com/huang-jw22/FetalAgents.
- EXPLORE-Bench: For egocentric scene prediction with long-horizon reasoning, as described in EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning. Code from multiple sources.
- OddGridBench & OddGrid-GRPO: From Shenzhen University, OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models evaluates and improves fine-grained visual discrepancy sensitivity. Code at https://wwwtttjjj.github.io/OddGridBench/.
- TubeMLLM & TubeMData: From Tsinghua University, TubeMLLM: A Foundation Model for Topology Knowledge Exploration in Vessel-like Anatomy is a foundation model and benchmark dataset for topology-aware medical imaging. No public code.
- Reading, Not Thinking: This paper (Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs) identifies the modality gap when text becomes pixels in MLLMs. No public code.
- MEGC2026 Challenge: For micro-expression detection with VQA, detailed at https://megc2026.github.io. No public code.
- HMR-1 & MedMassage-12K: From Chinese Academy of Sciences, HMR-1: Hierarchical Massage Robot with Vision-Language-Model for Embodied Healthcare is a hierarchical framework for embodied healthcare robotics and its dataset. Code repository for HMR-1.
- Granulon: From National University of Singapore and Nanyang Technological University, Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM enhances visual encoders with adaptive granularity control. Code at https://github.com/Nanyang-TECH/Granulon.
- Daily-Omni Benchmark: From Fudan University, Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities for audio-visual reasoning with temporal alignment. Code at https://github.com/Lliar-liar/Daily-Omni.
Impact & The Road Ahead:
These advancements herald a new era for MLLMs, pushing them toward more intelligent, reliable, and context-aware capabilities. The focus on 3D spatial understanding, particularly with methods like VEGA-3D and Motion-MLLM, is critical for embodied AI, robotics, and augmented reality, enabling machines to interact with our physical world with unprecedented accuracy. Innovations in reasoning, exemplified by SophiaVL-R1 and the VRGA framework, promise MLLMs that can not only generate fluent responses but also provide genuinely logical and verifiable explanations. This is crucial for applications demanding high trust, such as medical diagnostics and autonomous systems.
The emphasis on safety and robustness, as seen in CoDA for medical MLLMs and OOD-MMSafe for broader causal projection, is paramount for responsible AI deployment. Furthermore, domain-specific models like AgriChat and FetalAgents showcase the immense potential of MLLMs to revolutionize industries from agriculture to healthcare, tailoring general-purpose intelligence to specialized tasks. Benchmarks like MMOU, GeoAux-Bench, and OddGridBench are crucial for systematically identifying current limitations and guiding future research. As MLLMs continue to evolve, the integration of diverse modalities, enhanced reasoning, and a strong commitment to safety will pave the way for a new generation of AI assistants that are truly perceptive, reliable, and transformative.
Share this content:
Post Comment