Sample Efficiency: Unlocking the Future of AI with Less Data, More Impact
Latest 50 papers on sample efficiency: Feb. 14, 2026
The quest for greater sample efficiency is a central challenge and a holy grail in modern AI/ML. As models grow larger and tasks become more complex, the demand for vast amounts of data for training and generalization becomes a bottleneck. Researchers are actively pursuing innovative ways to reduce this data dependency, making AI more accessible, sustainable, and capable in real-world, data-scarce environments. This digest delves into recent breakthroughs, showcasing how diverse fields, from robotics to molecular design, are making strides in doing more with less.
The Big Ideas & Core Innovations
At the heart of these advancements lies a collective effort to imbue AI systems with smarter learning mechanisms, leveraging structured knowledge, advanced model architectures, and novel optimization techniques. Many papers focus on enhancing reinforcement learning (RL), a field notoriously data-hungry. For instance, Optimistic World Models: Efficient Exploration in Model-Based Deep Reinforcement Learning by Akshay Mete et al. from Texas A&M University introduces Optimistic World Models (OWMs), integrating optimistic dynamics loss into world models to dramatically improve exploration in sparse-reward environments. Similarly, ExO-PPO: an Extended Off-policy Proximal Policy Optimization Algorithm by Hanyong Wang and Menglong Yang from Sichuan University combines on-policy and off-policy methods in a new PPO variant to boost sample efficiency while maintaining stability.
Robotics and control systems are a major beneficiary of these innovations. The paper Accelerating Robotic Reinforcement Learning with Agent Guidance by Yijie Guo et al. (from University of Washington, UC Berkeley, ETH Zurich) presents AGPS, which automates supervision pipelines in robotic RL, bridging the sample efficiency gap between human-in-the-loop and fully automated methods. Further, JEPA-VLA: Video Predictive Embedding is Needed for VLA Models by Shangchen Miao et al. from Tsinghua University and Huawei Noah’s Ark Lab highlights the crucial role of video-based predictive embeddings like V-JEPA 2 for enhancing environment understanding and policy priors in vision-language-action (VLA) models, directly addressing poor generalization and low sample efficiency.
Beyond RL, other domains are also pushing the boundaries. In molecular optimization, Sample Efficient Generative Molecular Optimization with Joint Self-Improvement by Serra Korkmaz et al. from Helmholtz Zentrum Munchen and Technical University of Munich introduces JOINT SELF-IMPROVEMENT. This framework mitigates distribution shifts and high-variance updates in RL-based methods by using a joint generative-predictive model with self-improving sampling, leading to superior performance under limited evaluation budgets. For large language models, Reinforcement Learning with Promising Tokens for Large Language Models by Jing-Cheng Pang et al. from Huawei Technologies introduces RLPT, which improves efficiency and stability by focusing policy optimization on a subset of high-likelihood tokens, effectively reducing gradient variance.
Under the Hood: Models, Datasets, & Benchmarks
These research efforts introduce and leverage a variety of models, datasets, and benchmarks to achieve their impressive gains:
- Optimistic World Models (OWMs): A plug-and-play framework built on existing world models (e.g., DreamerV3) and evaluated on challenging sparse-reward environments. Code: https://github.com/weipu-zhang/STORM
- AGPS (Agent-Guided Policy Synthesis): Tested on robotic reinforcement learning tasks, demonstrating improved sample efficiency over traditional Human-in-the-Loop (HIL) methods. Code: https://agps-rl.github.io/agps
- JEPA-VLA: Integrates video-based predictive representations (like V-JEPA 2) into existing VLA models, enhancing performance across robotics benchmarks and real-world tasks.
- JOINT SELF-IMPROVEMENT: Utilizes a joint generative-predictive model for sample-efficient molecular optimization, outperforming state-of-the-art methods on both offline and online molecular optimization benchmarks. Code: https://github.com/schwallergroup/
- RLPT: Enhances policy optimization for Large Language Models (LLMs) like Qwen series models, demonstrating superior performance on mathematical reasoning (GSM8K), coding (HumanEval), and general instruction-following (AlpacaEval) tasks. Code: https://github.com/huggingface/open-r1
- Fun-DDPS: A generative framework combining function-space diffusion models with neural operator surrogates for carbon capture and storage (CCS) modeling, achieving robust forward modeling with limited data. Resources: https://arxiv.org/pdf/2602.12274
- JEPA-VLA: Employs video predictive embeddings, specifically V-JEPA 2, to overcome limitations of traditional image- or language-image-based visual representations in VLA models for robotics.
- Constrained Initial Representations (CIR): A framework for Temporal Difference Learning that uses the Tanh function to constrain initial representations, achieving strong empirical performance on numerous continuous control tasks. Resources: https://arxiv.org/pdf/2602.11800
- FlowAdapt: A parameter-efficient domain adaptation framework based on optimal transport theory, achieving state-of-the-art performance on collaborative perception benchmarks with only 1% trainable parameters. Resources: https://arxiv.org/pdf/2602.11565
- Exact Posteriors with Normalizing Flows: An oracle framework using class-conditional normalizing flows to decompose neural network error, revealing scaling laws beyond the loss curve on datasets like AFHQ and ImageNet. Code: https://github.com/TarFlow/TarFlow
- Hierarchical Goal-Conditioned RL via Normalizing Flows: A data-efficient framework for hierarchical goal-conditioned reinforcement learning, useful in scenarios with sparse or missing reward/action information. Code: https://github.com/Shaswat2001/heirarchical_RL
- Neuro-symbolic Action Masking (NSAM): Integrates symbolic reasoning using Probabilistic Sentential Decision Diagrams (PSDDs) into deep reinforcement learning, improving sample efficiency and reducing constraint violations. Code: https://github.com/shan0126/NSRL
- MOBONS (Multi-Objective Bayesian Optimization for Networked Black-Box Systems): Extends Bayesian optimization for networked black-box systems, using graph-based representations for efficient multi-objective optimization. Code: https://github.com/PaulsonLab/MOBONS
- ARO (Adaptively Rotated Optimization): A new matrix optimization paradigm based on gradient rotation, outperforming AdamW by up to 1.35x in large model pretraining. Resources: https://arxiv.org/pdf/2602.09006
- AT-GRPO (Adaptive Tree-based Group Relative Policy Optimization): Part of a dialogue agent framework, addressing short-horizon biases and improving sample efficiency in long-horizon RL for dialogue. Resources: https://arxiv.org/pdf/2602.08533
- Octopus: An RL rollout augmentation framework for Vision-Language Models (VLMs) that teaches self-correction, achieving state-of-the-art performance with reduced training time. Code: https://dripnowhy.github.io/Octopus/
- CBS (Contextual Rollout Bandits): A scheduler for Reinforcement Learning with Verifiable Rewards (RLVR), improving training efficiency and performance through noise-aware intra-group selection and adaptive global reuse of rollouts. Code: https://github.com/buaa-cbs/CBS
- Laplacian Keyboard (LK): A hierarchical framework using graph Laplacian eigenvectors as a reward basis for sample-efficient RL across tasks. Resources: https://arxiv.org/pdf/2602.07730
- Sequence-to-Sequence Models for Log Parsing: Evaluates Transformers and Mamba models, finding Transformers reduce parsing error by up to 23.4%, while Mamba offers competitive accuracy at lower computational cost. Resources: https://arxiv.org/pdf/2602.07698
- Continuous Program Search: Improves genetic programming efficiency by aligning mutation operators with latent behavioral geometry, demonstrating significant reductions in evaluation budget. Resources: https://doi.org/10.1145/nnnnnnn.nnnnnnn
- Language Bottleneck Models (LBMs): Uses natural language as an interpretable intermediate representation for knowledge state modeling, outperforming traditional models on three datasets. Resources: https://arxiv.org/pdf/2506.16982
- Deep Meta Coordination Graphs (DMCG): Leverages dynamic coordination graphs and graph convolutional networks for cooperative multi-agent RL, achieving state-of-the-art performance. Code: https://github.com/Nikunj-Gupta/dmcg-marl
- Aurora: A unified training-serving system integrating speculative decoding with reinforcement learning for LLMs, achieving 1.5x speedups on frontier models. Resources: https://aurora-spec-ai.github.io/
- JBR (Joint Experience Best Response): An efficient modification to PSRO that reuses shared experience across agents in multi-agent RL, enhancing sample efficiency and strategic robustness. Code: https://arxiv.org/pdf/2602.06599
- Progress Constraints for RL in Behavior Trees: Introduces constraints to guide RL exploration within behavior trees, leading to faster convergence in complex environments. Resources: https://arxiv.org/pdf/2602.06525
- Coupled Local and Global World Models: A framework for efficient first-order RL, demonstrating improved sample efficiency and performance. Code: https://github.com/your-organization/coupled-world-models
- Stochastic Hierarchical Data-Driven Optimization: Applies Sloppy Model theory to efficiently calibrate physical models, outperforming traditional optimization techniques in sample efficiency for plasma-surface kinetics. Code: https://github.com/scikit-optimize/
- Differential RL (dfPO): A novel framework that reformulates RL through continuous-time control, with pointwise convergence guarantees and competitive regret bounds for scientific computing tasks. Code: https://github.com/mpnguyen2/dfPO
- Stochastic Decision Horizons (SDH): A novel approach to constrained RL using survival-weighted objectives for efficient off-policy learning while maintaining constraint compliance. Code: https://github.com/hyfydy/hyfydy
- Pruning for Generalization: A transfer-oriented spatiotemporal graph framework that improves model generalization through novel pruning techniques for spatiotemporal data. Resources: https://arxiv.org/pdf/2602.04153
- Off-Policy Log-Dispersion Regularization (LDR): A regularization framework for training Boltzmann generators, improving data efficiency by up to one order of magnitude in sampling from unnormalized probability densities. Resources: https://arxiv.org/pdf/2602.03729
- Variance-Reduced MPPI: Leverages quadratic approximations of system dynamics to improve sample efficiency and stability in trajectory optimization for robotic systems. Code: https://github.com/MarcToussaint/robotic
- Reparameterization Flow Policy Optimization (RFO): Integrates flow-based policies with reparameterization policy gradients for high sample efficiency in robotic control tasks. Code: https://github.com/rewarped/rewarped
- GFlowPO: A probabilistic framework combining off-policy GFlowNet training with dynamic meta-prompt updates for sample-efficient prompt optimization in LLMs. Resources: https://arxiv.org/pdf/2602.03358
- Information-Theoretic Multi-Model Fusion: An adaptive sampling framework for materials design, redefining optimization as trajectory discovery and leveraging multi-model fusion for target-oriented search. Resources: https://arxiv.org/pdf/2602.03319
- SLOPE (Shaping Landscapes with Optimistic Potential Estimates): A framework for model-based RL that transforms sparse reward modeling into informative potential landscapes, enabling efficient planning and exploration. Resources: https://arxiv.org/pdf/2602.03201
- GCP (Graph of Concept Predictors): An active distillation framework that externalizes LLM reasoning into a graph, enabling interpretable and efficient training for discriminative models. Code: https://github.com/Ziyang-Yu/GCP
- GCR-RL: A reinforcement learning framework that enforces geometric coherence in value functions using order theory, improving sample efficiency and stability. Resources: https://arxiv.org/pdf/2602.02978
- MADT (Multi-Agent Decision Transformer): Reformulates traffic signal control as a sequence modeling problem, achieving state-of-the-art performance in traffic coordination. Resources: https://arxiv.org/pdf/2602.02903
- IDM-based Policies: Utilizes Inverse Dynamics Models (IDMs) to significantly improve sample efficiency in semi-supervised imitation learning compared to behavior cloning. Code: https://github.com/zuoxingdong/mazelab
- EchoJEPA: A latent predictive foundation model for echocardiography, trained on 18 million videos, outperforming existing methods in LVEF estimation and RVSP prediction. Code: https://github.com/bowang-lab/EchoJEPA
- CADENT: A hybrid distillation framework that unifies strategic and tactical knowledge for sample-efficient transfer in reinforcement learning, achieving 40-60% better performance than baselines. Resources: https://arxiv.org/pdf/2602.02532
Impact & The Road Ahead
These advancements in sample efficiency are not merely incremental improvements; they are foundational shifts that promise to unlock new frontiers for AI. By reducing the data burden, we can accelerate research in data-scarce domains like materials science (as shown by Technical University of Darmstadt’s work on Information-Theoretic Multi-Model Fusion for Target-Oriented Adaptive Sampling in Materials Design) and expand the deployment of complex AI systems in real-world scenarios, such as robotic manipulation of deformable objects (e.g., Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows from Stanford University et al.) and sustainable industrial processes (Multi-Objective Bayesian Optimization for Networked Black-Box Systems by Akshay Kudva et al. from The Ohio State University).
The ability to learn from fewer samples also has significant implications for ethical AI, reducing the carbon footprint of training large models, and fostering more equitable access to powerful AI technologies. The work on Language Bottleneck Models (Language Bottleneck Models for Qualitative Knowledge State Modeling by Antonin Berthon and Mihaela van der Schaar from University of Cambridge) offers a glimpse into how sample-efficient, interpretable models can revolutionize personalized education. Furthermore, the theoretical insights from papers like When Do Multi-Agent Systems Outperform? Analysing the Learning Efficiency of Agentic Systems by Junwei Su and Chuan Wu from University of Hong Kong are critical for guiding future development, ensuring that new architectures and algorithms are designed with efficiency in mind.
The road ahead involves continuous exploration of hybrid approaches, combining model-based and model-free methods, leveraging symbolic reasoning with deep learning, and integrating advanced optimization techniques across diverse problem spaces. As we continue to refine these methods, the future of AI will be characterized by intelligence that is not only powerful but also remarkably efficient and adaptable. The era of “more with less” is truly upon us, and the breakthroughs highlighted here are paving the way.
Share this content:
Post Comment