Unleashing the Power of Foundation Models: From Physics to Robotics and Beyond
Latest 100 papers on foundation models: May. 16, 2026
The landscape of AI/ML is undergoing a profound transformation, driven by the emergence and rapid evolution of foundation models. These colossal models, pre-trained on vast and diverse datasets, promise to generalize across a multitude of tasks and domains, fundamentally altering how we approach problem-solving in areas ranging from scientific discovery to embodied AI and clinical applications. This blog post dives into recent breakthroughs, showcasing how researchers are pushing the boundaries of these models, tackling critical challenges, and redefining the very architecture of intelligence.
The Big Idea(s) & Core Innovations
One of the most exciting trends is the quest for domain-agnostic generalization while simultaneously addressing domain-specific complexities. For instance, in scientific machine learning, the Eradicating Negative Transfer in Multi-Physics Foundation Models via Sparse Mixture-of-Experts Routing paper by Ellwil Sharma and Arastu Sharma from Shodh AI introduces Shodh-MoE, a sparse Mixture-of-Experts (MoE) architecture that autonomously prevents negative transfer in multi-physics simulations. Their key insight: models can discover stable computational partitions in latent space without predefined operator decomposition, allowing simultaneous convergence across conflicting PDE regimes. This is a game-changer for universal scientific foundation models.
Similarly, in time series, SurF: A Generative Model for Multivariate Irregular Time Series Forecasting reinterprets the Time Rescaling Theorem as a learnable bidirectional flow, enabling a single model to achieve zero-shot transfer across heterogeneous event-stream datasets. The authors, from the University of Toronto and Vector Institute, found that a shared Exp(1) target provides domain invariance, a crucial step towards foundation models for asynchronous event streams.
Robotics is seeing a paradigm shift towards spatial awareness and efficient action grounding. Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model by Tao Lin et al. from Shanghai Jiao Tong University and KAUST introduces Evo-Depth, a lightweight VLA framework that implicitly extracts depth features from multi-view RGB images. They demonstrate that implicit depth can replace heavy geometry models for robust manipulation. Complementing this, AttenA+: Rectifying Action Inequality in Robotic Foundation Models from HKUST(GZ) et al. identifies and corrects a fundamental flaw in robot training: treating all actions equally. Their velocity-driven action attention prioritizes precision-critical, low-velocity movements, leading to significant performance gains.
Security and robustness are paramount. One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries by Itay Zloczower et al. from Ben-Gurion University of the Negev reveals a critical vulnerability in current malicious fine-tuning (MFT) defenses. They propose adaptive attacks that optimize for both harmfulness and capability preservation, showing that existing defenses, which assume attackers optimize only for harm, are inherently flawed. In LLM agent security, Exploiting LLM Agent Supply Chains via Payload-less Skills by Xinyu Liu et al. from Zhejiang University introduces Semantic Compliance Hijacking (SCH), a novel payload-less attack where malicious intent is disguised as natural language compliance rules. This attack achieves high success rates with zero detection, revealing an “Alignment-Security Paradox” where highly aligned models are paradoxically more vulnerable to semantic deception.
In the realm of multimodal learning, SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning from The University of Texas at Dallas proposes a set-based approach using Submodular Mutual Information (SMI) to align multiple views of an entity, reducing the modality gap with significantly less data. Their work shows that standard pairwise objectives like CLIP emerge as special cases, highlighting the power of combinatorial optimization for multimodal alignment.
Under the Hood: Models, Datasets, & Benchmarks
Recent research is not just about new ideas; it’s also about building the foundational blocks – the models, datasets, and benchmarks that drive progress. Here’s a quick look at some key resources:
- Shodh-MoE: A sparse Mixture-of-Experts architecture for multi-physics foundation models, using a physics-informed autoencoder with Helmholtz-style velocity parameterization to guarantee exact mass conservation. Pretrained on The Well, Poseidon, PhysiX, and Walrus datasets. (arXiv:2605.15179)
- Evo-Depth: A lightweight VLA framework (0.9B parameters) using an Implicit Depth Encoding Module (IDEM) and Spatial Enhancement Module (SEM). Evaluated on Meta-World, VLA-Arena, LIBERO, and LIBERO-Plus benchmarks. Code available at https://github.com/MINT-SJTU/Evo-Depth.
- KGPFN: A knowledge graph foundation model leveraging Prior-data Fitted Networks for in-context learning. Achieves SOTA on 57 KG benchmarks. Code: https://github.com/HKUST-KnowComp/KGPFN.
- FactoryNet: The first universal pretraining corpus for industrial time-series data, with 51M datapoints across 23k task executions on six embodiments, following an S-E-F-C schema. Dataset and code: https://huggingface.co/datasets/factorynet/factorynet and https://github.com/factorynet0/FactoryNet.
- NeuroAtlas: The largest EEG benchmark (42 datasets, ~260k hours) for evaluating EEG foundation models across seizure detection, sleep staging, BCI, and brain age prediction. Highlights that EEG-specific FMs don’t consistently outperform general time-series FMs. (arxiv.org/pdf/2605.14698)
- TabPFN-3: A tabular data foundation model scaling to 1M training rows, 20x faster than its predecessor, with an attention-based many-class decoder and “Thinking mode” for test-time compute scaling. Code: https://docs.priorlabs.ai/ and https://github.com/PriorLabs/tabpfn-extensions/tree/main/src/tabpfn_extensions/many_class.
- Pan-FM: A pan-organ foundation model for learning cross-organ representations from 7 organ systems in the UK Biobank, addressing dominant-organ shortcut learning via Saliency-Guided Masking (SGM). (arxiv.org/pdf/2605.07055)
- EHR-RAGp: A retrieval-augmented foundation model for EHRs using prototype-guided retrieval, trained on MIMIC-IV. Code: https://github.com/nyuad-cai/EHR-RAGp.
- Agentick: A unified benchmark for sequential decision-making agents (RL, LLM, VLM, hybrid, human) with 37 procedurally generated tasks and 5 observation modalities. Leaderboard: https://roger-creus.github.io/agentick/board/.
- ISOMORPH: The first public digital twin of a multi-echelon logistics network for time-series forecasting benchmarks. Code: https://github.com/tuhinsahai/ISOMORPH.
- HumanNet: A one-million-hour human-centric video corpus for embodied AI, spanning first-person and third-person perspectives, with motion descriptions and hand/body signals. Code: https://github.com/DAGroup-PKU/HumanNet/.
Impact & The Road Ahead
The impact of these advancements is far-reaching. The development of robust, generalizable foundation models is poised to accelerate scientific discovery, enable more sophisticated and reliable robotic systems, and usher in a new era of personalized healthcare. For example, DT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System from Brigham and Women’s Hospital demonstrates strong next-event prediction on real-world EHR data, hinting at the future of clinical forecasting. Similarly, CORTEG: Foundation Models Enable Cross-Modality Representation Transfer from Scalp to Intracranial Brain Recordings shows how scalp-EEG FMs can be adapted to ECoG, reducing calibration time for brain-computer interfaces to minutes.
The push for efficient adaptation and deployment is also crucial. Papers like LoREnc: Low-Rank Encryption for Securing Foundation Models and LoRA Adapters introduce training-free security frameworks, enabling secure on-device AI with minimal overhead. The vision of an Intelligence Delivery Network: Toward an Internet Architecture for the AI Age proposed by Hanling Wang et al. from Pengcheng Laboratory, Shenzhen, suggests a future where AI capabilities are network services, intelligently distributed across cloud, edge, and local environments, transforming infrastructure itself.
However, challenges remain. The call for standardization and interpretability is loud. No One Knows the State of the Art in Geospatial Foundation Models critically audits the GFM literature, revealing a lack of shared benchmarks and significant discrepancies in reported performance, making true progress hard to gauge. Initiatives like AI Harness Engineering as outlined in AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents by Hailin Zhong and Shengxin Zhu argue for formalized runtime substrates to ensure verifiable, auditable software agents, bridging the ‘autonomy gap’ from code generation to robust software development.
The horizon is full of potential: from agentic hydrologic intelligence via sensor-specialized Mini-JEPAs (Mini-JEPA Foundation Model Fleet Enables Agentic Hydrologic Intelligence) to unified multi-task brain decoding using LLM backbones (UniMind: Unleashing the Power of LLMs for Unified Multi-Task Brain Decoding). The journey toward truly generalist, reliable, and interpretable AI is complex, but these breakthroughs show that foundation models are laying the groundwork for an intelligent future across all domains.
Share this content:
Post Comment