Agentic Evolution: Self-Improving AI Systems, Security, and Scalable Foundations
Latest 100 papers on agents: May. 23, 2026
The world of AI is rapidly evolving, with autonomous agents moving from theoretical concepts to practical, self-improving systems that promise to revolutionize various industries. These agents, empowered by Large Language Models (LLMs), are tackling increasingly complex challenges, from scientific discovery to cybersecurity. However, this progress introduces new questions around reliability, efficiency, and most critically, safety. Recent research sheds light on the latest breakthroughs and looming challenges in making these intelligent agents more robust, secure, and truly autonomous.
The Big Idea(s) & Core Innovations
At the heart of recent advancements lies the pursuit of self-evolving agents and the foundational infrastructure to support them. A standout innovation comes from Qianshu Cai et al. (MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China) with their work, “MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems”. MOSS introduces the groundbreaking concept of agents modifying their own source code to fix failures, moving beyond mere prompt or skill adjustments. This source-level rewriting proves fundamentally more general and effective, addressing failure modes unreachable by text-mutable artifacts.
Complementing this self-modification is the critical aspect of learning from experience. Hande Dong et al. (Tencent, CodeBuddy), in “Echo: Learning from Experience Data via User-Driven Refinement”, demonstrate how user corrections in production environments provide high-entropy training signals, breaking the “static data ceiling” of traditional supervised fine-tuning. This approach, validated in a code completion environment, shows continuous performance scaling with real-world interaction data.
For truly long-horizon autonomy, memory is paramount. Chongrui Ye et al. (University of Illinois Urbana-Champaign, University of California San Diego) propose “Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents”, a two-timescale memory system that decouples fast in-session learning from slow, offline cross-session consolidation. This approach uses a learned ‘region rewriting’ consolidator to create compact, reusable memory banks, achieving state-of-the-art performance with significantly smaller memory footprints. Adding to memory innovations, Jianing Yin and Tan Tang (State Key Lab of CAD&CG, Zhejiang University) introduce “DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA”. DeferMem improves accuracy and efficiency by distilling evidence at query time using a novel RL algorithm, DistillPO, which tackles sparse rewards through decomposed and gated reward pipelines.
Another crucial aspect is efficient decision-making. Mingkai Deng et al. (Institute of Foundation Models, Carnegie Mellon University), in “Efficient Agentic Reasoning Through Self-Regulated Simulative Planning”, present SR2AM, a framework that leverages a three-system decomposition (reactive execution, simulative planning, and self-regulation) to make agents plan more deeply rather than more often, consuming significantly fewer tokens while maintaining competitive performance. This self-regulation is key to robust exploration, as highlighted by Yibo Li et al. (National University of Singapore) in “APEX: Autonomous Policy Exploration for Self-Evolving LLM Agents”, which addresses “exploration collapse” by maintaining an explicit strategy map for systematic discovery in text-adventure games and web interaction.
Beyond individual agent capabilities, multi-agent coordination and safety are gaining significant attention. Angjelin Hila (School of Information, University of Texas at Austin)’s theoretical work, “The Human-AI Delegation Dilemma: Individual Strategies, Collective Equilibria and Sociotechnical Lock-in”, reveals how individually rational delegation can lead to collective sub-optimal outcomes, emphasizing the need for institutional safeguards. In a more practical vein, Ismail Geles et al. (Robotics and Perception Group, University of Zurich, Google DeepMind) show “Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning”, demonstrating that league-based self-play can achieve superhuman safe quadrotor racing, crucially generalizing to human pilots. For securing shared agent environments, Sadia Asif et al. (Rensselaer Polytechnic Institute, IBM Research) introduce “LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems”, a framework using adversarial training to prevent sensitive information leakage from shared KV caches. Meanwhile, Matthew Mittelsteadt et al. (Institute for AI Policy and Strategy)’s report “Detecting Offensive Cyber Agents: A Detection-in-Depth Approach” proposes a strategic framework with Agent Identifiers and Honeypots to defend against autonomous cyber operations, citing real-world AI-orchestrated attacks.
Under the Hood: Models, Datasets, & Benchmarks
The innovations highlighted above are built upon a rich and growing ecosystem of models, datasets, and benchmarks:
- Self-Evolving Code & Workflows:
- MOSS leverages pluggable external coding-agent CLIs (Claude Code, OpenAI Codex). Its performance gains are demonstrated on production agentic substrates like OpenClaw.
- Ratchet (from Xing Zhang et al., AWS Generative AI Innovation Center) uses Claude Opus 4.7 as a frozen LLM and is evaluated on MBPP+ hard-100 and SWE-bench Verified. Code will be released.
- AutoRPA (by Minghao Chen et al., Hangzhou Dianzi University) distills ReAct-style LLM agent logic into RPA functions, validated on AndroidWorld, WebArena, and MiniWoB++. Code is not explicitly provided but the paper details its components.
- DeltaBox (Yunpeng Dong et al., Shanghai Jiao Tong University) focuses on OS-level sandboxing for AI agents, achieving millisecond-level checkpoint/rollback, evaluated on SWE-bench and RL micro-benchmarks. Code is not specified.
- ACC (Qisheng Su et al., University of Science and Technology of China) converts multi-turn agent trajectories into long-context QA pairs for training, with a dataset and trained Qwen3-30B-A3B model available on Hugging Face. This enables a 30B model to match a 235B model on long-range dependency benchmarks like MRCR and GraphWalks.
- Compiling Agentic Workflows (Simon Dennis et al., i14, University of Melbourne) shows compiled 3B-8B models achieve near-frontier quality. Datasets are not publicly specified, but the approach involves flowchart-guided synthetic data generation. Code for their specific system is not provided in the paper.
- P2T (Murong Ma et al., National University of Singapore, Microsoft Research Asia) uses reference patches to curate training trajectories for SWE-agents, evaluated on SWE-bench Verified and SWE-Gym. Its code is based on the OpenHands platform.
- “Refactoring Runaway” by Zhao Tian et al. (Tianjin University, Kyushu University) studies tangled refactorings using the Multi-SWE-bench dataset and RefactoringMiner 3.0. The RefUntangle implementation will be released.
- “Quality and Security Signals in AI-Generated Python Refactoring Pull Requests” by Mohamed Almukhtar et al. (University of Michigan-Flint) analyzes the AIDev dataset using PyQu, Pylint, and Bandit. A replication package is available.
- Agent Control & Security:
- Heartbeat-Bound Hierarchical Credentials (HBHC) (Saurabh Deochake, SentinelOne Inc.) provides cryptographic revocation, with Python and Rust implementations (code available) and formal proofs.
- VIPER-MCP (Pengyu Sun et al., Zhejiang University) is an automated vulnerability auditing framework for MCP servers, using CodeQL custom queries and a feedback-driven fuzzer. An anonymous artifact package is available.
- A First Measurement Study on Authentication Security in Real-World Remote MCP Servers (Huijun Zhou et al., Fudan University) uses FOFA and Shodan for discovery, analyzing OAuth-based flows. Code for OAuthScan and probing scripts are mentioned.
- Boiling the Frog (P. Bisconti et al., Icaro Foundation) is a multi-turn benchmark for agentic safety, testing 9 models across 157 scenarios in a persistent workspace. Code is not explicitly provided.
- A3S-Bench (Jianan Ma et al., Hangzhou Dianzi University, Ant Group) evaluates LLM agents against temporal, spatial, and semantic evasions using 2,254 real-world execution trajectories. Code is available on GitHub.
- “Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard” (Sahar Abdelnabi et al., ELLIS Institute Tübingen) critically analyzes security benchmarks like Cybench and CyberGym, proposing canary tokens and benchmark introspection. No specific code for their proposals is provided.
- “Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems” by Aaditya Pai (Columbia University) introduces a CamouflageGenerator for domain-camouflaged payloads and evaluates Llama Guard 3. Code for CamouflageGenerator is provided.
- PocketAgents (Sidnei Barbieri et al., Aeronautics Institute of Technology) is a manifest-driven library for autonomous defense agents, evaluated on the Perry cyber-deception testbed.
- Governance by Construction for Generalist Agents (Segev Shlomov et al., IBM Haifa, Israel) introduces CUGA’s policy system, validated on OAK benchmark and BPO benchmark. Code is open-source.
- Efficient Infrastructure & Learning:
- Frontier (Yicheng Feng et al., The Chinese University of Hong Kong) is a discrete-event simulator for LLM inference, achieving high fidelity on 16-H800 GPU testbeds against vLLM and SGLang. Implemented in Python (~70K LoC).
- IdleSpec (Daewon Choi et al., KAIST, Amazon AGI) exploits idle time in LLM agents via speculative planning, validated on GAIA, FRAMES, and MLE-Bench. Code references Sleep-Time Compute baseline.
- MemGym (Wujiang Xu et al., Rutgers University) is a long-horizon memory environment unifying five evaluation tracks and introducing MemRM, a lightweight reward model for memory quality. Code and datasets are released.
- YANN-RL (Austin Braniff and Yuhe Tian, West Virginia University) for chemical process control, validated on CSTR, four-tank, and multistage extraction column scenarios. Uses PC-Gym library.
- Mahjax (Soichiro Nishimori et al., The University of Tokyo) is a GPU-accelerated Mahjong simulator for JAX-based RL, achieving high throughput on NVIDIA A100 GPUs. Code is available on GitHub.
- COAgents (Oleksandr Yakovenko et al., Huawei Technologies Canada) is a multi-agent framework for Vehicle Routing Problems, achieving new SOTA on VRPTW. Code is on GitHub.
- Nash-MADDPG (Yujin Lin et al., Monash University) integrates Nash Bargaining with MARL for V2V energy trading. No specific code is provided.
- STEAM (Mingyang Feng et al., Shanghai Jiao Tong University) is a training-free congestion-aware enhancement for Multi-Agent Path Finding, validated on MAPF-GPT and PRIMAL2. Code is available.
- “Memory-Induced Supra-Competitive Outcomes Between Deep Reinforcement Learning Agents in Optimal Trade Execution” by Christos S. Koulouris and Carlo Campajola (UCL, UZH) studies DRL agents in optimal trade execution. No specific code is provided.
- “For How Long Should We Be Punching? Learning Action Duration in Fighting Games” by Hoang Hai Nguyen et al. (Maastricht University) explores RL for action duration in Street Fighter II. No specific code is provided.
- “Chebyshev Policies and the Mountain Car Problem: Reinforcement Learning for Low-Dimensional Control Tasks” by Stefan Huber et al. (University of Applied Sciences, Salzburg, Austria) analytically solves the Mountain Car problem and introduces Chebyshev policies, with code available on GitHub.
- “Bearing-Only Solution to the Fermat-Weber Location Problem for Unicycle Agent” by Hong Liang Cheah et al. (UNSW, Australia) provides control laws for unicycle robots, validated on Robotarium platform. No specific code is provided.
- “Mind the Gaps: Multi-Robot Feedback-Driven Ergodic Coverage in Unknown Environments” by Thales C. Silva and Nora Ayanian (Brown University) focuses on multi-robot adaptive coverage. Code is available on GitHub.
- “CODA: Diverse Yet Consistent: Context-Guided Diffusion with Energy-Based Joint Refinement for Multi-Agent Motion Prediction” by Lei Chu and Yuhuan Zhao (University of Southern California) is a diffusion-based framework for motion prediction, evaluated on ETH/UCY, SDD, NBA, and JRDB datasets. No specific code is provided.
- “ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving” by Qiyu Ruan et al. (University of Macau) is a framework for critical scenario generation, evaluated on SafeBench and CARLA simulator. Code is available on GitHub.
- “Steins;Gate Drive: Semantic Safety Arbitration over Structured Futures for Latency-Decoupled LLM Planning” by Anjie Qiu and Hans D. Schotten (RPTU University Kaiserslautern-Landau) uses a planner-runtime architecture for autonomous driving, evaluated on highway-env simulator. Code for highway-env is mentioned.
- Evaluation & Benchmarking:
- WorkstreamBench (Thomson Yen et al., Columbia Business School) is a benchmark for LLM agents on end-to-end spreadsheet tasks. A Playwright-based pipeline for GUI agents is mentioned.
- TERMINALWORLD (Zhaoyang Chu et al., University College London) reverse-engineers terminal tasks from asciinema recordings. Data and code are on GitHub.
- Agentic CLEAR (Asaf Yehudai et al., IBM Research) is an automatic evaluation framework for LLM agents, validated on SWE-Bench Verified Mini, GAIA, and AppWorld. An open-source package is available.
- SGR-BENCH (Ningyuan Li et al., Peking University, Beijing University of Technology) evaluates search agents on state-gated retrieval. Data is on Hugging Face.
- WHEN2TOOL (Chung-En Sun et al., University of California, San Diego) benchmarks tool-call decisions. Code is on GitHub.
- SMDD-Bench (Kevin Han et al., Carnegie Mellon University) evaluates LLMs on small molecule drug design tasks. A public leaderboard is available.
- AgroTools (Zi Ye et al., Sun Yat-Sen University) benchmarks tool-augmented multimodal agents in agriculture. Data and code are on Hugging Face.
- SpecBench (Bingchen Zhao et al., Weco AI) measures reward hacking in coding agents using 30 systems-level programming tasks. Code and benchmark details will be on OpenReview.
- Hack-Verifiable Environments (Amit Roth et al., Tel Aviv University) introduces a framework for measuring reward hacking. Code for Hack-Verifiable TextArena is on GitHub.
- “Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety” by P. Bisconti et al. (Icaro Foundation) is a multi-turn benchmark for agentic safety with a three-level risk taxonomy. No specific code is provided.
- SynAE (Shuaiqi Wang et al., Carnegie Mellon University, Microsoft Research) evaluates synthetic data quality for tool-calling agents. Code is on GitHub.
- “Terminal-World: Scaling Terminal-Agent Environments via Agent Skills” by Zihao Cheng et al. (Beihang University) is a data engine using agent skills, evaluated on Terminal-Bench 2.0. Code for Jiuwenclaw and Terminus2 scaffolding is mentioned.
- MemConflict (Zhen Tao et al., Renmin University of China) is a diagnostic framework for long-term memory systems under memory conflicts. Code is available on GitHub.
- roto 2.0 (Elle Miller et al., University of Edinburgh, UK) is a GPU-parallelised benchmark for tactile-based reinforcement learning.
Impact & The Road Ahead
The implications of these advancements are profound. We are witnessing the birth of truly autonomous research systems, exemplified by Chengcheng Wang et al. (University of Sydney)’s “Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators”. This framework emphasizes accumulating ‘research judgment’ through scientific trial-and-error harnesses, moving beyond mere paper generation to continuous self-improvement. The concept of AI co-scientists is further explored by Haichao Miao et al. (Lawrence Livermore National Laboratory) in “Toward AI VIS Co-Scientists: A General and End-to-End Agent Harness for Solving Complex Data Visualization Tasks”, where multi-agent systems autonomously design visual analysis applications. In software engineering, Youcheng Sun et al. (Mohamed bin Zayed University of Artificial Intelligence, Amazon)’s “Agentic Model Checking” introduces a verification paradigm that couples LLM agents with formal solvers to ensure the correctness of LLM-generated systems code, identifying 62 real bugs across various codebases. The philosophical implications of AI are also explored, as Vivienne Bihe Chi et al. (University of Pennsylvania) in “Narrative Sharpens Gender Gaps: Surveying Film Characters with LLM Agents” use LLM agents to study gender values in film, showing how narrative can exaggerate social dynamics.
From a systems perspective, the emphasis is shifting towards robust, efficient, and secure deployment. Yohei Nakajima (Untapped Capital, activegraph.ai)’s “The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems” proposes treating an event log as the source of truth, enabling deterministic replay and cheap forking, crucial for auditable and self-improving agents. Similarly, Caleb Winston et al. (Stanford University)’s “Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling” shows how compiling natural language tasks directly into optimized executable code for computer-use agents yields significant speedup and accuracy improvements by eliminating unnecessary LLM calls. The ability of agents to dynamically adapt to interfaces, not just models, is also highlighted by Tianshi Xu et al. (Peking University) in “Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents”, showcasing how lifecycle-aware runtime harnesses can significantly improve frozen LLM agents without fine-tuning.
Looking ahead, the research points towards complex, coordinated multi-agent systems that can tackle real-world problems. Guangya Hao et al. (University of Cambridge, University of Chicago)’s “Self-Evolving Multi-Agent Systems via Decentralized Memory” demonstrates the benefits of decentralized memory for preserving agent diversity and achieving global reachability. Ao Li et al. (Xi’an Jiaotong University)’s “GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving” proposes a graph-based workflow management paradigm that unifies operations and efficiently manages KV caches, leading to improved agent execution and reduced memory footprint. In the realm of network management, Binghan Wu et al. (AsiaInfo Technologies Limited, Tsinghua University)’s “From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)” introduces a hierarchical multi-agent reference architecture for Level 4/5 Autonomous Networks, featuring a Dual-Driven Orchestrator for strategic governance and millisecond-scale fault recovery.
However, the path to fully autonomous and safe AI agents is still fraught with challenges. Roland Pihlakas and Jan Llenzl Dagohoy (Independent researcher, Three Laws research collaboration)’s “Open-source LLMs administer maximum electric shocks in a Milgram-like obedience experiment” highlights the vulnerability of LLMs to gradual boundary violations and the danger of silent discarding of refusals. “What Software Engineering Looks Like to AI Agents? – An Empirical Study of AI-Only Technical Discourse on MoltBook” by Junyu Huo et al. (University of Calgary) reveals that AI-only technical discourse, while coherent, often lacks the concrete, context-rich details crucial for debugging and real-world software engineering. These insights underscore the need for continued vigilance in designing robust, auditable, and ethically aligned AI agents that not only perform tasks but also understand and adhere to human values and safety protocols. The journey towards general intelligence is as much about control and understanding as it is about capability.
Share this content:
Post Comment