Unlocking AI’s Next Frontier: A Roundup of Breakthroughs in Foundation Models
Latest 50 papers on foundation models: Nov. 23, 2025
Foundation models are the bedrock of modern AI, offering incredible potential across diverse domains, from vision to language, and even scientific simulation. Yet, harnessing their full power often involves navigating complex challenges like domain adaptation, efficiency, and robustness. This past research period has brought forth a fascinating array of advancements, pushing the boundaries of what these powerful models can achieve. Let’s dive into some of the latest breakthroughs and their implications.
The Big Idea(s) & Core Innovations
A central theme emerging from recent research is the synergistic integration of multimodal data and adaptive learning strategies to overcome limitations and enhance model capabilities. We’re seeing a push to make foundation models more adaptable, efficient, and robust across various, often challenging, real-world scenarios.
For instance, in the realm of 3D scene understanding, researchers are actively bridging the gap between 2D and 3D learning. Imperial College London’s team, in their paper “POMA-3D: The Point Map Way to 3D Scene Understanding”, introduces POMA-3D, which leverages point maps to preserve global geometry while being compatible with 2D foundation models. Similarly, the work from RWTH Aachen University and Bosch Center for AI, “DINO in the Room: Leveraging 2D Foundation Models for 3D Segmentation”, demonstrates how features from 2D vision foundation models like DINOv2 can be injected or distilled into 3D models for state-of-the-art 3D segmentation, even without 2D data at inference. Further solidifying this 3D vision trend, “ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation” by Bosch Research pioneers a vision-only approach, generating 3D pseudo-labels from video using geometric and semantic foundation models, eliminating the need for LiDAR or manual 3D annotations. This data-centric approach achieves a substantial 34% relative improvement on the Occ3D-nuScenes benchmark.
Addressing the critical need for robustness and generalization, especially in challenging conditions, multiple papers offer innovative solutions. “Enhancing Generalization of Depth Estimation Foundation Model via Weakly-Supervised Adaptation with Regularization” from South China University of Technology introduces WeSTAR, a parameter-efficient framework that improves depth estimation in unseen and corrupted domains through weak supervision and regularization. In a similar vein, “Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models” by researchers from the University of Technology Sydney and York University, presents Uni-Adapter, a training-free online test-time adaptation strategy for 3D Vision-Language Models (VLFMs). This method dynamically updates prototypes to handle domain shifts, achieving state-of-the-art results on corrupted datasets.
For generative AI, the “Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation” by Kandinsky Lab introduces a suite of models with a novel NABLA mechanism that significantly reduces computational complexity for high-resolution and long-duration video generation, making advanced generative capabilities more efficient. Peking University’s “Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model” presents DSD, unifying the encoder, decoder, and diffusion model into a single network, addressing latent collapse issues and achieving competitive results with fewer parameters.
In specialized domains, foundation models are showing immense promise. “Walrus: A Cross-Domain Foundation Model for Continuum Dynamics” from the Flatiron Institute and NYU introduces a model designed to simulate continuum dynamics across diverse physical scenarios, leveraging adaptive-compute tokenization for efficiency. In healthcare, “X-WIN: Building Chest Radiograph World Model via Predictive Sensing” by Rensselaer Polytechnic Institute and Massachusetts General Hospital integrates 3D spatial knowledge into a CXR world model to improve disease diagnosis. Additionally, Stanford University’s “nnMIL: A generalizable multiple instance learning framework for computational pathology” provides a scalable MIL framework for computational pathology, enhancing slide-level predictions with principled uncertainty estimation.
However, it’s not always a clear win for foundation models. In “Are Foundation Models Useful for Bankruptcy Prediction?”, researchers from Wrocław University of Science and Technology find that classical machine learning methods often outperform LLMs like Llama-3.3 and TabPFN in structured financial prediction tasks, pointing to issues with calibration and computational cost. This highlights the importance of domain-specific evaluation and understanding where these powerful models truly add value.
Under the Hood: Models, Datasets, & Benchmarks
These advancements are often enabled by novel architectures, specially curated datasets, and rigorous benchmarking. Here’s a quick look at the core resources driving these breakthroughs:
- POMA-3D: Introduces ScenePoint, a large-scale point map dataset from 6.5K RGB-D scenes and 1M 2D image scenes, and POMA-JEPA, a joint embedding-predictive architecture. Project page: https://matchlab-imperial.github.io/poma3d
- LAOF: Leverages optical flow constraints as pseudo-supervision for robust latent action learning. Code available: https://github.com/XizoB/LAOF
- iLTM: An Integrated Large Tabular Model combining GBDTs, hypernetworks, and MLPs, meta-trained on thousands of real-world tabular datasets. Code: https://github.com/AI-sandbox/iLTM
- DITR: Integrates 2D foundation models like DINOv2 for 3D segmentation, with a distillation approach (D-DITR) for pretraining 3D models using unlabeled images. Project page: https://vision.rwth-aachen.de/ditr
- Upsample Anything: A universal, training-free method for feature upsampling by learning anisotropic Gaussian kernels at test-time. Paper: https://arxiv.org/pdf/2511.16301
- Walrus: A cross-domain foundation model for continuum dynamics, incorporating patch jittering, 2D-to-3D data augmentation, and adaptive-compute tokenization. Code: https://github.com/PolymathicAI/walrus
- GEO-Bench-2: A comprehensive evaluation framework with 19 curated datasets and ‘capability’ groups for Geospatial Foundation Models (GeoFMs). Paper: https://arxiv.org/pdf/2511.15658
- TSFM in-context learning: Utilizes pre-trained General Time Transformer (GTT) models with few-shot prompting for time-series classification. Paper: https://arxiv.org/pdf/2511.15447
- Uni-Adapter: A training-free online TTA strategy for 3D VLFMs, using cluster-based caching and graph-based label smoothing. Project page: https://mehran-tam.github.io/Uni-Adapter
- Unbiased Semantic Decoding: Uses vision foundation models with a novel decoding framework for few-shot segmentation. Code: https://github.com/vangjin/USD
- Kandinsky 5.0: A family of image/video generation models, featuring the NABLA mechanism and a multi-stage training pipeline. Code: https://github.com/kandinskylab/kandinsky-5
- nnMIL: A generalizable multiple instance learning framework for computational pathology, incorporating random sampling at patch and feature levels. Code: https://github.com/Luoxd1996/nnMIL
- MergeDNA: A hierarchical framework for genome modeling, with a learnable DNA tokenizer and dynamic token merging. Paper: https://arxiv.org/pdf/2511.14806
- RoboCrafter-QA: A benchmark and fine-tuned LLM approach for soft robot design. Code: https://github.com/robocrafterqa/robocrafterqa
- DSD: A unified end-to-end trainable network for diffusion modeling, addressing latent collapse. Paper: https://arxiv.org/pdf/2511.14716
- SweeperBot: A system for accessible 3D browsing, combining optimal view selection with generative and recognition-based foundation models. Paper: https://arxiv.org/pdf/2511.14567
- SEED-SR: Uses segmentation-aware latent diffusion for 20× super-resolution of satellite images, leveraging multiple geo-spatial foundation models. Paper: https://arxiv.org/pdf/2511.14481
- MAVias: An open-set bias mitigation approach for computer vision using vision-language embeddings from foundation models. Paper: https://arxiv.org/pdf/2412.06632
- LED: Light Enhanced Depth estimation for nighttime autonomous driving, compatible with architectures like Adabins, DepthFormer, and Depth Anything V2, and a new Nighttime Synthetic Drive Dataset. Project page: https://simondemoreau.github.io/LED/
- SenseNova-SI: A family of multimodal foundation models for spatial intelligence, trained on SenseNova-SI-8M, eight million spatially grounded data samples. Code: https://github.com/OpenSenseNova/SenseNova-SI
- UnSAMv2: A self-supervised framework for granularity-controllable segmentation, learning hierarchical structures from only 6,000 unlabeled images. Project page: https://yujunwei04.github.io/UnSAMv2-Project-Page/
- Lang1: A domain-specialized LLM trained on 80 billion clinical tokens and web text, evaluated on the ReMedE benchmark for hospital operations. Paper: https://arxiv.org/pdf/2511.13703
- OlmoEarth: A spatio-temporal, multimodal foundation model for Earth observation, introducing Latent MIM Lite, a modality-aware masking strategy, and a novel contrastive loss. Code and platform: https://github.com/allenai/olmoearth_pretrain and olmoearth.allenai.org
- NuClass: A multi-scale integration framework for histopathology images with Path local and Path global components, and a marker-guided dataset from spatial transcriptomics. Paper: https://arxiv.org/pdf/2511.13586
- ViXML: A multi-modal framework for Extreme Multi-label Classification, integrating visual metadata from Amazon Reviews with decoder-only models. Code: https://github.com/DiegoOrtego/vixml
- FGNet: Transfers knowledge from SAM2 to 3D EM neuron segmentation using a Feature-Guided Attention module. Paper: https://arxiv.org/pdf/2511.13063
- GeoUniPS: A photometric stereo method leveraging geometric priors from 3D reconstruction models and a Light-Geometry Dual-Branch Encoder. Code: https://github.com/marcotam2002/geounips
- DiffuDepGrasp: Uses diffusion models to model depth noise for Sim2Real robotic grasping. Project page: https://diffudepgrasp.github.io/
Impact & The Road Ahead
The collective impact of this research is profound, painting a picture of AI systems that are more perceptive, adaptive, and capable of tackling real-world complexities. The advancements in 3D vision, from self-supervised learning with point maps to LiDAR-free occupancy estimation, are paving the way for more robust autonomous systems and richer immersive experiences. The strides in generative models, particularly in efficiency and quality, promise to democratize access to powerful creative AI tools.
Critically, the emphasis on domain-specific adaptation and bias mitigation points towards a future where foundation models are not just powerful, but also reliable, fair, and trustworthy. The medical imaging breakthroughs, especially in pathology and radiology, underscore the potential for AI to revolutionize diagnostics and patient care, provided models are rigorously specialized and evaluated. The push for multimodal integration, as seen in vision-language synergy for abstract reasoning or integrating vision into XMC, indicates a future where AI understands and interacts with the world in a more holistic, human-like manner.
However, challenges remain. The insights from bankruptcy prediction highlight that generalist models are not a panacea, and careful consideration of model limitations, computational cost, and interpretability is crucial for high-stakes applications. The fragility of multimodal models to prompt variations emphasizes the need for more robust training strategies and data augmentation. Future work will likely focus on even more efficient domain adaptation, deeper multimodal reasoning, and building robust, trustworthy AI systems that can seamlessly operate across diverse, dynamic environments. The journey towards truly intelligent and adaptable foundation models is ongoing, and these recent breakthroughs mark significant steps forward.
Share this content:
Discover more from SciPapermill
Subscribe to get the latest posts sent to your email.
Post Comment