Multimodal Large Language Models: A Leap Towards Unified, Intelligent Understanding
Latest 50 papers on multimodal large language models: Sep. 29, 2025
Multimodal Large Language Models (MLLMs) are revolutionizing how AI interacts with and interprets the world, moving beyond text to encompass vision, audio, and even 3D environments. This rapidly evolving field is pushing the boundaries of what’s possible, tackling challenges from nuanced human-like reasoning to robust real-world deployment. Recent research showcases incredible strides in integrating diverse modalities, enhancing model efficiency, and fortifying safety—all while pushing towards a more unified and intelligent AI.
The Big Idea(s) & Core Innovations
At the heart of these advancements lies a common ambition: to enable MLLMs to perceive, reason, and act with human-like proficiency across multiple data types. A central theme is the development of unified frameworks that seamlessly blend different modalities. For instance, OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment from Tsinghua University introduces a framework for latent space alignment that unifies understanding, generation, and retrieval tasks. Similarly, MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer by Apple tackles the challenge of integrating vision understanding and image generation within a single model, using a hybrid tokenizer to balance continuous embeddings for understanding and discrete tokens for generation.
Another significant area of innovation is enhancing reasoning capabilities through targeted training and novel architectural components. Papers like MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning from HKUST (GZ), HKUST, HIT propose a reinforcement learning framework with process reasoning rewards to boost video temporal understanding. Expanding on this, VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception by researchers including Zhejiang University introduces Visual Test-Time Scaling (VTTS) to enhance MLLMs through iterative visual perception, mimicking human hierarchical attention. For geometric reasoning, GeoRef: Referring Expressions in Geometry via Task Formulation, Synthetic Supervision, and Reinforced MLLM-based Solutions from authors associated with LLAVA-VL GitHub, OpenAI leverages synthetic supervision and reinforcement learning to improve MLLM performance in complex geometry tasks. In the medical domain, LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition? by East China Normal University presents a framework and tailored training strategies to improve zero-shot medical disease recognition from radiology images.
Efficiency and robustness are also key drivers. Sparse Training Scheme for Multimodal LLM from Peking University, University of Illinois Urbana-Champaign introduces a Sparse Training Scheme (STS) with a Visual Token Compressor and Layer Dynamic Skipper to significantly reduce training overhead. In a similar vein, MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe by OpenBMB details improvements in architecture, data strategy, and training methods to create powerful MLLMs with reduced computational costs. Addressing real-world deployment challenges, Adaptive Guidance Semantically Enhanced via Multimodal LLM for Edge-Cloud Object Detection from Institute of Computing Technology, Chinese Academy of Sciences offers an adaptive guidance framework for efficient edge-cloud collaborative object detection. Furthermore, MoA-Off: Adaptive Heterogeneous Modality-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference by a similar group of researchers from Institute of Computing Technology, Chinese Academy of Sciences proposes dynamic workload scheduling based on modality-specific complexity to reduce latency and resource overhead in MLLM inference.
Safety, trustworthiness, and ethical considerations are increasingly paramount. SUA: Stealthy Multimodal Large Language Model Unlearning Attack from The Pennsylvania State University, Amazon exposes vulnerabilities in MLLM unlearning processes, showing how forgotten knowledge can be recovered via adversarial perturbations. Complementary to this, SafeEraser: Enhancing Safety in Multimodal Large Language Models through Multimodal Machine Unlearning by The Hong Kong University of Science and Technology (Guangzhou) introduces a benchmark and novel techniques like Prompt Decouple Loss to enhance safety unlearning without over-forgetting. Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models by ByteDance, CASIA addresses the critical problem of OCR hallucinations in MLLMs when processing degraded documents, introducing a new benchmark and a GRPO-based framework. Finally, Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM from Lehigh University, The Chinese University of Hong Kong, Shenzhen provides a critical look at data contamination across MLLMs, highlighting its prevalence and impact on benchmarks.
Under the Hood: Models, Datasets, & Benchmarks
This collection of papers highlights the critical role of specialized models, expansive datasets, and rigorous benchmarks in driving MLLM progress. Here’s a snapshot of key resources:
- Models & Frameworks:
- MOSS-ChatV (MOSS-ChatV training pipeline): A reinforcement learning framework for video temporal reasoning.
- VideoChat-R1.5 (https://github.com/OpenGVLab/VideoChat-R1): Enhances MLLMs through iterative visual perception.
- GeoRef (https://arxiv.org/pdf/2509.21050): A framework for referring expressions in geometry using synthetic supervision and RL.
- SupCLAP (https://arxiv.org/pdf/2509.21033): Uses Support Vector Regularization for stable audio-text contrastive learning.
- FORCE (https://arxiv.org/pdf/2509.21029): Improves transferability of visual jailbreaking attacks via feature over-reliance correction.
- LLaVA-RadZ (https://github.com/EastChinaNormalUniversity/LLaVA-RadZ): MLLM framework for zero-shot medical disease recognition.
- Adaptive Guidance Semantically Enhanced Framework (https://arxiv.org/pdf/2509.19875): For edge-cloud object detection with MLLMs.
- OmniBridge (https://github.com/xiao-xt/OmniBridge): A unified framework for multimodal understanding, generation, and retrieval.
- PhotoEye (https://github.com/daiqing98/The-Photographers-Eye): MLLM trained for aesthetic visual understanding and photography critique.
- Qianfan-VL (https://github.com/baidubce/Qianfan-VL): Domain-enhanced vision-language models for OCR, document, and mathematical reasoning.
- Baseer (https://arxiv.org/pdf/2509.18174): A vision-language model fine-tuned for Arabic document OCR.
- MiniCPM-V 4.5 (https://github.com/OpenBMB/MiniCPM-V): An efficient and powerful MLLM with a unified 3D-Resampler.
- Sparse Training Scheme (STS) (https://arxiv.org/pdf/2509.18150): A training-efficient framework for MLLMs with Visual Token Compressor and Layer Dynamic Skipper.
- TempSamp-R1 (https://github.com/HVision-NKU/TempSamp-R1): A reinforcement fine-tuning framework for temporal video understanding.
- LLaVA-AV-SSM (https://github.com/naver-ai/LLaVA-AV-SSM): An audio-visual Video-LLM baseline.
- WISE (https://github.com/yiwenJG/WISE-MCoT): Enhances MLLM interpretability via weak-supervision-guided step-by-step explanations.
- MLLM-Driven Semantic Identifier Generation (https://arxiv.org/pdf/2509.17359): Leverages LLMs to generate semantic identifiers for cross-modal retrieval.
- MoA-Off (https://arxiv.org/pdf/2509.16995): Adaptive heterogeneous modality-aware offloading for efficient MLLM inference.
- Interpretable Audio Editing Evaluation (https://github.com/NKU-HLT/Eval Reasoning): Framework for automated audio editing evaluation using MLLMs and Chain-of-Thought reasoning.
- SD-RPN (https://github.com/YuHengsss/SD-RPN): Self-Distilled RoI Predictors for fine-grained MLLM perception.
- Text-Scene (https://arxiv.org/pdf/2509.16721): A scene-to-language parsing framework for 3D scene understanding.
- FESTA (https://github.com/iiscleap/mllm-uncertainty-estimation): Novel method for uncertainty estimation in MLLMs via functionally equivalent and complementary sampling.
- VAT-KG (https://huggingface.co/vatkg/VATKG_CODE): A multimodal knowledge graph dataset for retrieval-augmented generation.
- 3D MLLMs for CT Report Generation (https://github.com/bowang/lab/AMOS-MM-Solution): Decoupled architecture design for radiology report generation.
- KIE-HVQA Framework (https://github.com/hiyouga/EasyR1): GRPO-based framework to mitigate OCR hallucinations.
- SUA (https://github.com/Zood123/MLLM-Unlearning-Attack): Stealthy multimodal LLM unlearning attack framework.
- SafeEraser (https://github.com/yuu250/SafeEraser): Enhances safety in MLLMs through multimodal machine unlearning.
- MM-DETECT (https://github.com/MLLM-Data-Contamination/MM-Detect): An analytical tool for detecting multimodal data contamination.
- MANZANO (https://arxiv.org/pdf/2509.16197): A simple and scalable unified multimodal model with a hybrid vision tokenizer.
- Sycophantic Reflective Tuning (SRT) (https://arxiv.org/pdf/2509.16149): Mitigates visual sycophantic behavior in MLLMs.
- BaseReward (https://arxiv.org/pdf/2509.16127): A strong baseline for Multimodal Reward Models.
- SEE&TREK (https://github.com/opencv/opencv-python): Training-free spatial prompting for MLLMs.
- EmoQ (https://arxiv.org/pdf/2509.15775): Speech Emotion Recognition via Speech-Aware Q-Former and Large Language Model.
- BTL-UI (https://github.com/xiaomi-research/btl-ui): Blink-Think-Link Reasoning Model for GUI Agent.
- Beyond Spurious Signals (https://github.com/Zichen-Wu/Multimodal-Mixture-of-Expert-Debiasing): Debiasing MLLMs via counterfactual inference and adaptive expert routing.
- Perception-R1 (https://github.com/tongxiao2002/Perception-R1): Enhances multimodal reasoning via visual perception reward.
- OSPO (https://github.com/korea-university/OSPO): Object-centric self-improving preference optimization for text-to-image generation.
- ReasonPlan (https://github.com/Liuxueyi/ReasonPlan): Unified scene prediction and decision reasoning for autonomous driving.
- FC-Attack (https://github.com/ZZYHKUSTGZ/FC_Attack): Jailbreaking MLLMs via auto-generated flowcharts.
- Datasets & Benchmarks:
- MOSS-Video (https://arxiv.org/abs/2502.13923): Large-scale video state prediction with reasoning annotations.
- VTTS-80K (https://github.com/OpenGVLab/VideoChat-R1): For iterative perception and multimodal reasoning.
- PhotoCritique & PhotoBench (https://github.com/daiqing98/The-Photographers-Eye): For aesthetic visual understanding.
- Misraj-DocOCR (https://huggingface.co/datasets/Misraj/Misraj-DocOCR): High-quality benchmark for Arabic OCR evaluation.
- AVQA-Hard & Music-AVQA-Hard (https://arxiv.org/pdf/2509.17901): To evaluate audio-visual understanding in Video-LLMs.
- InPlan3D (https://arxiv.org/pdf/2509.16721): Comprehensive benchmark for embodied task planning in 3D environments.
- NUMINA (https://github.com/fengshun124/NUMINA): Benchmark for multi-dimensional intelligence and numerical reasoning.
- MOMENTS (github.com/villacu/MoMentS): Comprehensive multimodal benchmark for Theory of Mind.
- VAT-KG (https://huggingface.co/datasets/vatkg/VATKG_DATASET): Knowledge-intensive multimodal knowledge graph.
- KIE-HVQA (https://huggingface.co/datasets/bytedance-research/KIE-HVQA): Benchmark for OCR hallucinations in degraded document understanding.
- SAFEERASER (https://arxiv.org/pdf/2502.12520): Benchmark for safety unlearning in MLLMs.
- SRT-30K (https://arxiv.org/pdf/2509.16149): Dataset for training MLLMs in developing reflective capabilities.
- TennisTV (https://arxiv.org/pdf/2509.15602): Benchmark for tennis rally understanding.
- GeoReasoning-10K (https://arxiv.org/pdf/2509.15217): Dataset for geometric image caption synthesis.
Impact & The Road Ahead
The impact of these advancements resonates across various domains, from enhancing autonomous driving (ReasonPlan by Beijing Natural Science Foundation) and medical diagnostics (LLaVA-RadZ and 3D MLLMs for CT report generation) to transforming content moderation (M-PACE: Mother Child Framework for Multimodal Compliance by Sprinklr AI) and improving recommendation systems (Serendipitous Recommendation with Multimodal LLM by Google DeepMind, YouTube). The focus on efficiency (e.g., MiniCPM-V 4.5, Sparse Training Scheme) makes advanced MLLMs more accessible for real-world deployment, especially in edge-cloud environments.
However, significant challenges remain. The sycophantic modality gap identified in Pointing to a Llama and Call it a Camel by HKUST and the pervasive data contamination discussed in Both Text and Images Leaked! highlight the need for more robust training, evaluation, and unlearning mechanisms. Benchmarks like NUMINA and MOMENTS reveal that current MLLMs still struggle with fine-grained numerical reasoning and complex social intelligence, often relying too heavily on textual cues over richer visual and audio information.
The future of MLLMs promises a unified AI capable of truly understanding our complex, multimodal world. As researchers continue to refine architectures, construct richer datasets, and develop more robust safety protocols, we are moving closer to intelligent systems that can perceive, reason, and interact with unprecedented sophistication. The journey is long, but the breakthroughs highlighted here are clear indicators of an exciting, transformative path forward.
Post Comment