Unlocking New Horizons: Recent Breakthroughs in Foundation Models Across Vision, Language, and Science
Latest 100 papers on foundation models: Mar. 14, 2026
The landscape of AI/ML is being continually reshaped by the rapid evolution of foundation models. These powerful, pre-trained behemoths are proving to be invaluable general-purpose tools, capable of handling a stunning array of tasks with minimal task-specific fine-tuning. However, their sheer scale and complexity also present unique challenges, from ensuring fair and unbiased behavior to achieving efficient deployment in resource-constrained environments. Recent research has been pushing the boundaries, addressing these critical aspects and extending the reach of foundation models into exciting new domains.
The Big Idea(s) & Core Innovations
The overarching theme in recent foundation model research is a dual pursuit: enhancing versatility and interpretability while simultaneously tackling practical limitations like efficiency and bias. We’re seeing models become more ‘aware’ of their context, whether it’s the physical world, temporal dynamics, or even their own internal workings.
In the realm of multimodal understanding and interaction, significant strides are being made. Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion by Lijiang Li et al. from Nanjing University pioneers a shift from autoregressive to diffusion-based architectures for any-to-any multimodal language models, promising more flexible and efficient processing. Complementing this, Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities by Ziwei Zhou et al. from Fudan University introduces a benchmark highlighting the critical need for robust cross-modal temporal alignment for deep audio-visual understanding. For tangible robotic interaction, TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation by Leslie Pack Kaelbling and Tomás Lozano-Pérez from MIT and UC Berkeley enables robots to interpret and execute complex tasks from natural language, while SELF-VLA: A Skill Enhanced Agentic Vision-Language-Action Framework for Contact-Rich Disassembly by Zhang, Chen et al. (various affiliations) and APPLV: Adaptive Planner Parameter Learning from Vision-Language-Action Model by Y. Xu et al. empower robots to adaptively plan and manipulate in contact-rich and dynamic environments. Further enhancing robotic perception, OmniGuide: Universal Guidance Fields for Enhancing Generalist Robot Policies by Yi Zhang et al. (UC Berkeley, Stanford) improves VLA models by integrating diverse guidance sources, while Safe-Night VLA: Seeing the Unseen via Thermal-Perceptive Vision-Language-Action Models for Safety-Critical Manipulation by Zitkovich et al. (NVIDIA, MIT CSAIL) introduces thermal perception for robust, safety-critical manipulation in challenging conditions.
Computer vision continues to leverage foundation models for enhanced perception and understanding. OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams by Xiaohui Shen et al. from Carnegie Mellon University introduces a unified streaming visual backbone capable of diverse tasks like perception, reconstruction, and action without fine-tuning, leveraging causal spatiotemporal attention and 3D-RoPE. In 3D vision, DVD: Deterministic Video Depth Estimation with Generative Priors by Harold Haodong Chen et al. (EnVision-Research, Google Research) combines generative and discriminative strengths for high-fidelity video depth estimation, while Shape-of-You: Fused Gromov-Wasserstein Optimal Transport for Semantic Correspondence in-the-Wild by Jiin Im et al. from Hanyang University uses 3D geometric structure for globally consistent semantic matching. X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models by Yueen Ma and Irwin King from The Chinese University of Hong Kong unifies 3D Gaussian Splatting with multimodal models for real-time semantic SLAM and language-driven tasks. For resource-efficient 3D understanding, Pointy – A Lightweight Transformer for Point Cloud Foundation Models by Konrad Szafer et al. (Poznan University of Technology) demonstrates that smaller, well-designed models can outperform larger ones with less data. EventVGGT: Exploring Cross-Modal Distillation for Consistent Event-based Depth Estimation by Yinrui Ren et al. (HKUST(GZ), CUHK) leverages cross-modal distillation from VFMs to achieve temporally consistent event-based depth estimation in challenging conditions. Lastly, VG3S: Visual Geometry Grounded Gaussian Splatting for Semantic Occupancy Prediction by Zhiyuan Li et al. from National University of Singapore integrates visual geometry with Gaussian splatting for more accurate 3D scene understanding.
Addressing critical issues of bias and interpretability, Locating Demographic Bias at the Attention-Head Level in CLIP’s Vision Encoder by Shi, Gandelsman et al. (Google Research, Stanford University) reveals that demographic bias in CLIP’s vision encoder is localized to specific attention heads, which can be identified and analyzed. For trustworthiness, RandMark: On Random Watermarking of Visual Foundation Models by Anna Chistyakova and Mikhail Pautov introduces a robust watermarking method for visual foundation models, ensuring ownership verification even after fine-tuning and pruning. In medical imaging, the impact of human input is highlighted in Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation by Caroline Magga et al. (University of Amsterdam), showing that human prompts significantly affect performance.
Time series analysis and causal inference are also seeing transformative applications. TimeSqueeze: Dynamic Patching for Efficient Time Series Forecasting by Sravan Kumar Ankireddy et al. (University of Texas at Austin) optimizes forecasting efficiency by adaptively selecting patch boundaries based on local signal complexity. GTM: A General Time-series Model for Enhanced Representation Learning of Time-Series Data by Cheng He et al. (University of Science and Technology of China) introduces a frequency-domain attention mechanism for improved time-series representation. For causal insights, Frequentist Consistency of Prior-Data Fitted Networks for Causal Inference by Valentyn Melnychuk et al. (LMU Munich) proposes a one-step posterior correction method to address prior-induced confounding bias in PFNs. Building on this, Interventional Time Series Priors for Causal Foundation Models by Dennis Thumm and Ying Chen from National University of Singapore introduces CausalTimePrior, a framework for generating synthetic temporal structural causal models for training causal foundation models. Further pushing time series analysis, Dissecting Chronos: Sparse Autoencoders Reveal Causal Feature Hierarchies in Time Series Foundation Models by Anurag Mishra from Rochester Institute of Technology uses sparse autoencoders to reveal depth-dependent causal feature hierarchies in Chronos-T5, showing that mid-encoder layers are most critical for forecasting. In terms of data quality, Rating Quality of Diverse Time Series Data by Meta-learning from LLM Judgment by Shunyu Wu et al. (Sun Yat-sen University) leverages LLMs and meta-learning to assess the quality of diverse time series data, providing a generalizable rating model. For robust time series applications, Retrieval-Augmented Generation with Covariate Time Series by Kenny Ye Liang et al. (Tsinghua University) introduces RAG4CTS, a regime-aware RAG framework for industrial time series, integrating physics-informed retrieval for predictive maintenance. Lastly, Impermanent: A Live Benchmark for Temporal Generalization in Time Series Forecasting by Azul Garza et al. (TimeCopilot, University of Oxford) provides a live benchmark for evaluating temporal generalization in time series forecasting, using sequentially updated data streams to reflect real-world dynamics.
In medical imaging and genomics, foundation models are offering unprecedented capabilities. SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images by Yichi Zhang et al. (Fudan University) introduces a novel foundation model for PET image segmentation and PETS-5k, the largest PET segmentation dataset. Similarly, Med-DualLoRA: Local Adaptation of Foundation Models for 3D Cardiac MRI by Perramon-Llussà et al. improves generalization in multi-center cardiac MRI by decoupling global and local adaptations using dual low-rank modules. In computational pathology, MINT: Molecularly Informed Training with Spatial Transcriptomics Supervision for Pathology Foundation Models by Lee, Chen et al. (Bioptimus, UCSF, Stanford) integrates spatial transcriptomics supervision, improving performance on both molecular and morphological tasks. FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis by Xiaohui Hu and Jiawei Huang (UCSF, Stanford) automates fetal ultrasound analysis through a multi-agent system, supporting end-to-end video summarization and clinical reporting. To make these models accessible, MobileFetalCLIP: Selective Repulsive Knowledge Distillation for Mobile Fetal Ultrasound Analysis by Noman Saeed et al. (MBZUAI, Cambridge) compresses large vision-language models for mobile fetal ultrasound analysis without sacrificing zero-shot performance. For resource-efficient radiology, GreenRFM: Toward a resource-efficient radiology foundation model by Yingtai Li et al. (University of Science and Technology of China) prioritizes principled supervision over brute-force scaling, achieving state-of-the-art performance with significantly reduced computational requirements. MIL-PF: Multiple Instance Learning on Precomputed Features for Mammography Classification by Nikola Jovišić et al. (University of Belgrade) leverages precomputed features from frozen foundation models for efficient mammography classification. RPG-SAM: Reliability-Weighted Prototypes and Geometric Adaptive Threshold Selection for Training-Free One-Shot Polyp Segmentation by W. Lin and Y. Bai introduces a training-free framework for one-shot polyp segmentation addressing regional heterogeneity. In a crucial area of privacy, How Private Are DNA Embeddings? Inverting Foundation Model Representations of Genomic Sequences by Not-A-Feature highlights critical privacy risks associated with DNA embeddings from foundation models. Enhancing clinical predictions, EveryQuery: Zero-Shot Clinical Prediction via Task-Conditioned Pretraining over Electronic Health Records by Payal Chandak et al. (Harvard-MIT, Columbia) enables zero-shot clinical prediction from EHRs with task-conditioned pretraining. For fine-tuning medical models, Self-Auditing Parameter-Efficient Fine-Tuning for Few-Shot 3D Medical Image Segmentation by Son Thai Ly and Hien V. Nguyen introduces SEA-PEFT, a self-auditing framework for optimal PEFT configuration search. Finally, a comprehensive overview in Computational Pathology in the Era of Emerging Foundation and Agentic AI – International Expert Perspectives on Clinical Integration and Translational Readiness by Qian Da et al. reviews the clinical integration and translational readiness of AI in computational pathology, highlighting challenges and opportunities.
Other areas are also seeing innovative applications. In remote sensing, FedEU: Evidential Uncertainty-Driven Federated Fine-Tuning of Vision Foundation Models for Remote Sensing Image Segmentation by Zhang Xuekai et al. (Tsinghua University) improves segmentation robustness through evidential uncertainty reduction in federated settings. SIGMAE: A Spectral-Index-Guided Foundation Model for Multispectral Remote Sensing by Xiaokang Zhang et al. (Wuhan University) leverages spectral indices to guide pretraining, outperforming existing methods in spatial and spectral reconstruction. LEPA: Learning Geometric Equivariance in Satellite Remote Sensing Data with a Predictive Architecture by Lars Bellier et al. (Swiss State Secretariat for Education, Research and Innovation) leverages geometric equivariance for efficient satellite remote sensing, while Spectral Gaps and Spatial Priors: Studying Hyperspectral Downstream Adaptation Using TerraMind by Julia A. Leonardi et al. (Politecnico di Milano, IBM Research Europe) explores the adaptability of multimodal geospatial foundation models to hyperspectral imaging tasks. Demystifying KAN for Vision Tasks: The RepKAN Approach by Minjong Cheon from Sejong University introduces an interpretable hybrid architecture combining CNNs with KANs for remote sensing image classification. In game AI, Resource-constrained Amazons chess decision framework integrating large language models and graph attention by Tianhao Qian et al. (Southeast University) combines graph-based learning with LLMs to create high-performance game AI under resource constraints. For electricity price forecasting, Regression Models Meet Foundation Models: A Hybrid-AI Approach to Practical Electricity Price Forecasting by Yunzhong Qiu et al. (Tsinghua University) introduces FutureBoosting, a hybrid AI approach that combines TSFMs with regression techniques for improved accuracy.
Under the Hood: Models, Datasets, & Benchmarks
These advancements are underpinned by novel architectures, extensive datasets, and rigorous benchmarks. Here’s a glimpse into the key resources driving progress:
- OmniStream: A unified streaming visual backbone using causal spatiotemporal attention and 3D-RoPE. Code: https://github.com/Go2Heart/OmniStream
- DVD: Leverages pre-trained video diffusion models for deterministic video depth estimation. Code: https://github.com/EnVision-Research/DVD
- Frequentist Consistency of Prior-Data Fitted Networks for Causal Inference: Implemented with a martingale-based OSPC framework. Code: https://anonymous.4open.science/r/frequentist-pfns/
- Exhaustive Circuit Mapping of a Single-Cell Foundation Model: Analyzes the Geneformer V2-316M model (available on HuggingFace: https://huggingface.co/ctheodoris/Geneformer). Code for SAE training: https://github.com/Biodyn-AI/sae-biological-map
- ELISA: Integrates scGPT expression embeddings with semantic retrieval and LLM interpretation for single-cell genomics. Code: https://github.com/omaruno/ELISA-An-AI-Agent-for-Expression-Grounded-Discovery-in-Single-Cell-Genomics.git
- Locating Demographic Bias at the Attention-Head Level in CLIP’s Vision Encoder: Utilizes CLIP ViT-L-14 encoder. Code: https://github.com/huggingface/transformers (for CLIP), https://github.com/google-research/Conceptual-Embeddings (for CAV-based methods)
- Shape-of-You: Reformulates semantic correspondence as a Fused Gromov-Wasserstein optimal transport problem. Code: https://github.com/hanyang-univ/Shape-of-You
- TimeSqueeze: A dynamic patching mechanism compatible with various Transformer backbones for time series forecasting. No public code provided yet.
- Hierarchical Granularity Alignment and State Space Modeling: Leverages DINOv2 and WavLM foundation models. Code: https://github.com/harryjun/HGA-SSM
- Interventional Time Series Priors for Causal Foundation Models: Introduces CausalTimePrior for synthetic TSCM generation. Code: https://github.com/thummd/CausalTimePrior
- SELF-VLA: A vision-language-action framework for contact-rich disassembly. Code: https://github.com/self-vla/self-vla
- SegAnyPET: A modality-specific 3D foundation model for PET image segmentation; introduces PETS-5k dataset (5,731 3D whole-body PET images). Code: https://arxiv.org/pdf/2502.14351
- GTM: A general time-series model with a novel Fourier attention mechanism. Code: https://github.com/MMTS4All/GTM
- Med-DualLoRA: Federated fine-tuning for 3D Cardiac MRI using dual low-rank modules. Code: https://github.com/username/Med-DualLoRA
- Pointy: A lightweight transformer for point cloud processing, achieving strong performance with limited data. Code: https://github.com/KonradSzafer/Pointy
- BALD-SAM: An active prompting framework for interactive segmentation leveraging Bayesian uncertainty modeling within SAM. No public code provided yet.
- RandMark: A watermarking methodology for VFMs. No public code provided yet.
- Prompting with the human-touch: Provides an open-source codebase for prompt extraction and model inference. Code: https://github.com/CarolineMagg/segmentation-FM-benchmark/
- Resource-constrained Amazons chess decision framework: Integrates LLMs and graph attention. Code: https://github.com/Resource-constrained-Amazons-Chess
- OilSAM2: A memory-augmented segmentation framework tailored for SAR oil spill detection, leveraging Segment Anything Model 2 (SAM2). Code: https://github.com/Chenshuaiyu1120/OILSAM2
- An Automated Radiomics Framework for Postoperative Survival Prediction: Introduces SAMONAI (extending SAM to 3D) and SurvAMINN (autoencoder-based MIL network). No public code provided yet.
- Dissecting Chronos: Applies sparse autoencoders to Chronos-T5-Large. No public code provided yet.
- OmniGuide: A unified framework for incorporating multiple types of guidance into Vision-Language-Action (VLA) models. Code: https://omniguide.github.io/
- Evaluating Progress in Graph Foundation Models: Introduces a comprehensive benchmark for GFMs. Code: https://github.com/smufang/GFMBenchmark
- TAMUSA-Chat: An open research framework for developing LLM-based conversational systems. Code: https://github.com/alsmadi/TAMUSA_LLM_Based_Chat_app
- SOTA: A training-free framework for zero-shot classification with multiple foundation models. Code: https://github.com/Afleve/self-adaptive-Optimal-Transport
- SignalMC-MED: A large-scale multimodal benchmark dataset (22,256 visits) for biosignal FMs using synchronized ECG and PPG. Code: https://github.com/fregu856/SignalMC-MED
- World2Mind: A training-free toolkit for allocentric spatial reasoning. No public code provided yet.
- X-GS: An extensible open framework unifying 3DGS architectures with downstream multimodal models. Code: No public code provided yet.
- Variational Routing: A scalable Bayesian framework for calibrated Mixture-of-Experts Transformers. No public code provided yet.
- EventVGGT: Leverages VGGT (a multi-view foundation model) for annotation-free depth estimation. No public code provided yet.
- MIL-PF: Uses precomputed features from frozen DINOv2 and MedSigLIP for mammography classification. Code: https://github.com/njovisic/MIL-PF
- When Detectors Forget Forensics: Introduces Geometric Semantic Decoupling (GSD) for AI-generated image detection. No public code provided yet.
- UniField: A unified framework for enhancing MRI images; introduces a large-scale paired multi-field MRI dataset. No public code provided yet.
- Zero-Shot and Supervised Bird Image Segmentation: Uses Grounding DINO 1.5, YOLOv11, and SAM 2.1. Code: https://github.com/mvsakrishna/bird-segmentation-2025
- Retrieval-Augmented Generation with Covariate Time Series: Introduces RAG4CTS framework for TSFMs in industrial applications. Code: https://github.com/apache/iotdb/tree/research/rag4cts
- Impermanent: A live benchmark for temporal generalization in time series forecasting, based on GitHub activity streams. Code: https://github.com/TimeCopilot/impermanent
- FOMO-3D: A multi-modal 3D object detection framework using OWLv2 and Metric3D with LiDAR data. Code: The paper refers to several arXiv IDs for related models but no specific FOMO-3D repository is listed directly in the
codefield. - Efficient Credal Prediction through Decalibration: Evaluates on large models like TabPFN and CLIP. Code: https://github.com/pwhofman/efficient-credal-prediction
- Learning Multiple Utterance-Level Attribute Representations: Uses a shared speech encoder for semantic and speaker attributes. Code: https://github.com/speechbrain/speechbrain/tree/develop/recipes/CommonVoice/SENSE
- Distributional Regression with Tabular Foundation Models: Evaluates realTabPFNv2.5 and TabICLv2. Code: https://github.com/PriorLabs/TabPFN/pull/689
- Covenant-72B: A 72B-parameter LLM trained via decentralized, trustless peer collaboration. Code: https://huggingface.co/PsycheFoundation/consilience-40b-7Y9v38s5
- UniGround: A training-free framework for open-world zero-shot 3D visual grounding. No public code provided yet.
- Tiny Autoregressive Recursive Models: Explores compute allocation in autoregressive Transformers. Code: https://github.com/pauliusrauba/autoregressive-TRM
- EveryQuery: An EHR foundation model for zero-shot clinical prediction. No public code provided yet.
- MINT: Integrates spatial transcriptomics supervision into pathology ViTs (e.g., UNI2-h on Hugging Face: https://huggingface.co/MahmoodLab/UNI2-h). Code: https://github.com/bioptimus/releases/tree/main/models/h-optimus/v0
- LEPA: A predictive architecture for learning geometric equivariance in satellite remote sensing. Code: https://github.com/embed2scale/LEPA
- FedEU: A federated learning approach for remote sensing image segmentation using evidential uncertainty reduction. Code: https://github.com/zxk688/FedEU
- SIGMAE: A spectral-index-guided foundation model for multispectral remote sensing. Code: https://github.com/zxk688/SIGMAE
- Continual Adaptation for Pacific Indigenous Speech Recognition: Investigates cross-lingual transfer in underrepresented languages. No public code provided yet.
- GazeMoE: An MoE-based framework for gaze target perception. Code: https://github.com/GazeMoE
- FreeOcc: A training-free framework for panoptic occupancy prediction, leveraging Segment Anything (SAM3) and MapAnything. Code: https://github.com/FreeOcc/FreeOcc
- GreenRFM: A resource-efficient radiology foundation model. Code: https://github.com/GreenRFM
- CaTok: A 1D causal image tokenizer with a MeanFlow decoder. No public code provided yet.
- RePer-360: Uses perspective priors and self-modulation for 360° depth estimation. No public code provided yet.
- OVGGT: A training-free online streaming framework for 3D geometry inference. No public code provided yet.
- MemSeg-Agent: A memory-augmented agent for medical image segmentation. No public code provided yet.
- Self-Auditing Parameter-Efficient Fine-Tuning: Introduces SEA-PEFT for few-shot 3D medical image segmentation. Code: https://github.com/tsly123/SEA_PEFT
- Open-World Task and Motion Planning: Introduces OWL-TAMP, combining VLMs and TAMP. Code: https://github.com/nvidia-research/owl-tamp
- Exploring the potential and limitations of Model Merging: Introduces MergeWhisper toolkit for multi-domain ASR adaptation. Code: https://github.com/INESC-ID/mergekit
- Dark3R: A framework for Structure from Motion (SfM) in low-light conditions. Code: andrewguo.com/pub/dark3r
- SarcasmMiner: A dual-track post-training framework for robust audio-visual sarcasm reasoning. Code: https://github.com/qwenlm/SarcasmMiner
- AIM-SLAM: Dense Monocular SLAM via Adaptive and Informative Multi-View Keyframe Prioritization. Code: https://aimslam.github.io/
- Efficient Domain-Adaptive Multi-Task Dense Prediction: Uses vision foundation models for efficient domain adaptation. Code: https://github.com/fudan-zvg/Semantic-Segment-Anything
Impact & The Road Ahead
The collective impact of this research is profound, pushing foundation models beyond mere academic curiosities into powerful, practical tools. We’re seeing a clear trend towards making these models more efficient, interpretable, and adaptable to real-world complexities. The emphasis on techniques like knowledge distillation (e.g., MobileFetalCLIP, EventVGGT), parameter-efficient fine-tuning (e.g., Med-DualLoRA, SEA-PEFT), and novel attention mechanisms (e.g., TimeSqueeze, GTM) speaks to the urgent need for deploying powerful AI responsibly and sustainably.
From medicine (FetalAgents, SegAnyPET, GreenRFM, MINT) to robotics (SELF-VLA, TiPToP, OmniGuide, Safe-Night VLA) and environmental monitoring (OilSAM2, FedEU, SIGMAE), foundation models are democratizing access to advanced AI capabilities. The development of specialized benchmarks (Daily-Omni, Impermanent, SignalMC-MED) and frameworks for evaluating bias (Locating Demographic Bias at the Attention-Head Level in CLIP’s Vision Encoder) and ethical deployment (TAMUSA-Chat) is critical for fostering trust and ensuring equitable access to these technologies.
The road ahead promises even more exciting advancements. We can anticipate further integration of physics-informed AI for robust predictions (RAG4CTS, On the Value of Tokeniser Pretraining in Physics Foundation Models), more sophisticated multimodal fusion strategies, and agentic AI systems that can reason and interact with the world in increasingly human-like ways. The focus on mitigating biases, enhancing privacy (How Private Are DNA Embeddings?), and ensuring robust performance under diverse conditions will be paramount. As foundation models continue to evolve, they will undoubtedly unlock new possibilities across science, industry, and daily life, but their true potential will only be realized through continued collaboration, innovation, and a strong commitment to responsible AI development.
Share this content:
Post Comment