Large Language Models: From Reasoning to Reliable Agents and Efficient Hardware

Latest 100 papers on large language models: Oct. 20, 2025

Large Language Models (LLMs) continue to astound us with their rapid advancements, pushing the boundaries of AI capabilities across an ever-expanding array of domains. However, as their complexity grows, so do the challenges: ensuring their reasoning is robust, their interactions are reliable, their deployment is efficient, and their outputs are safe and fair. Recent research is tackling these multifaceted issues head-on, delivering innovative solutions that promise to unlock the next generation of AI applications.

The Big Idea(s) & Core Innovations

The central theme emerging from recent work is the push towards more reliable, efficient, and interpretable LLMs and Multimodal Large Language Models (MLLMs). Researchers are moving beyond sheer scale, focusing on deeper reasoning, better contextual understanding, and practical deployability.

One significant leap in reasoning comes from process supervision and reward modeling. Papers like GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning by authors from LMU Munich and Technical University of Munich introduce frameworks that enhance multi-step reasoning by ensuring factual fidelity through structured paths and external verification. Complementing this, LaSeR: Reinforcement Learning with Last-Token Self-Rewarding from Renmin University of China and Tencent simplifies reinforcement learning with verifiable rewards (RLVR) by showing that the true reasoning reward can be efficiently computed from the last-token self-rewarding score. This is echoed in AutoRubric-R1V: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning by the University of Notre Dame and Uniphore, which uses rubric-based generative rewards to combat spurious reasoning in multimodal tasks, achieving state-of-the-art results.

For multimodal capabilities, MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning from CUHK and Huawei introduces a framework for LMMs to perform visual chain-of-thought reasoning for complex math problems by generating and strategically using high-fidelity diagrams. Similarly, VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning by Shanghai Jiao Tong University and Huawei enhances video temporal grounding using visual tools like progress bars. Further advancements in multimodal understanding are seen in Vision-Centric Activation and Coordination for Multimodal Large Language Models by Shanghai Jiao Tong University and Ant Group, which uses discriminative alignment to integrate vision-centric information for better visual understanding, and Spatial Preference Rewarding for MLLMs Spatial Understanding from Nanyang Technological University, which optimizes for precise spatial reasoning through direct preference optimization.

Efficiency and cost reduction are addressed in several papers. Agentic NL2SQL to Reduce Computational Costs by the University of Freiburg and Prior Labs significantly cuts down token usage in NL2SQL tasks by selectively querying database metadata. MX+: Pushing the Limits of Microscaling Formats for Efficient Large Language Model Serving from Seoul National University introduces an extension to low-bit quantization that boosts LLM inference performance with minimal overhead. Furthermore, Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling by Zhejiang University and Imperial College London proposes a theoretical framework for combining LLM and Process Reward Model (PRM) signals, achieving superior test-time scaling efficiency with reduced computational cost.

Agentic systems are also seeing significant innovation. The Gatekeeper Knows Enough by BoA AI CoE proposes a protocol for LLM agents to reason on low-fidelity system representations, improving reliability and efficiency. In the realm of code, CodeEvolve: An open source evolutionary coding agent for algorithm discovery and optimization from Inter Science and Worcester Polytechnic Institute combines LLMs with genetic algorithms for automated algorithm discovery and optimization. Addressing the practical deployment of agents, Natural Language Tools: A Natural Language Approach to Tool Calling In Large Language Agents by PokketCoach demonstrates significant improvements in tool calling accuracy and reduces variance by using natural language outputs instead of structured JSON.

Under the Hood: Models, Datasets, & Benchmarks

To drive these innovations, researchers are developing a rich ecosystem of specialized models, datasets, and benchmarks:

Reasoning & Multimodal Models:
- MathCanvas (https://mathcanvas.github.io/) and MathCanvas-Bench are introduced to facilitate visual chain-of-thought for mathematical reasoning in LMMs.
- SQ-LLM is a speech-quality-aware model trained with chain-of-thought reasoning and reward optimization for interpretable speech quality evaluation in SpeechLLM-as-Judges.
- RoboGPT-R1 (https://github.com/alibaba/EasyR1) is a two-stage fine-tuning framework for enhanced robotic planning on long-horizon tasks, outperforming GPT-4o-mini.
- Vgent (https://xiaoqian-shen.github.io/Vgent) uses a graph-based RAG framework for long video understanding with Large Video Language Models (LVLMs).
- KRLM (https://anonymous.4open.science/r/KRLM-EA36) unifies knowledge graphs and LLMs for inductive reasoning, demonstrating strong zero-shot and transfer capabilities.
Efficiency & Compression:
- Elastic-Cache (https://vila-lab.github.io/elastic-cache-webpage/) adaptively recomputes KV caches in diffusion LLMs, improving decoding speed without compromising quality.
- MX+ improves low-bit quantization for efficient LLM serving, enabling higher performance with negligible slowdown.
- A Free Lunch in LLM Compression: Revisiting Retraining after Pruning highlights Wanda and SparseGPT as pruning methods, showing the importance of local reconstruction.
Safety & Alignment:
- Qwen3Guard (https://github.com/QwenLM/Qwen3Guard) is a multilingual safety guardrail model with tri-class classification and real-time streaming safety detection.
- GuardSpace preserves safety alignment during LLM fine-tuning using a novel safety-sensitive subspace and harmful-resistant null-space.
- MedTrust-Align (https://arxiv.org/pdf/2510.14400) integrates iterative retrieval-verification and hallucination-aware preference optimization to improve biomedical question answering.
- ATGen uses adversarial reinforcement learning for test case generation, enhancing LLM debugging and code quality.
Evaluation & Benchmarking:
- MetaBench is the first comprehensive benchmark for LLMs in metabolomics, identifying performance variations and bottlenecks.
- Pluto is a new benchmark for evaluating the functional correctness and synthesis efficiency of LLM-generated hardware code.
- RAGCap-Bench (https://github.com/jingru-lin/RAGCap-Bench) offers a capability-oriented benchmark for agentic RAG systems, focusing on intermediate tasks like planning and evidence extraction.
- FinDeepResearch introduces HisRubric, a framework for evaluating deep research agents in rigorous financial analysis.
- GLOBALGROUP (https://github.com/cgsol/globalgroup) evaluates abstract reasoning across multiple languages, revealing linguistic biases and model architectural impacts.
- MathMist (https://github.com/mahbubhimel/MathMist) is a parallel multilingual benchmark dataset for mathematical problem-solving and reasoning.

Impact & The Road Ahead

These advancements herald a future where LLMs are not just powerful, but also more trustworthy, efficient, and adaptable. The focus on refining reasoning through explicit process supervision, integrating multimodal inputs for richer understanding, and optimizing for real-world deployment is critical. Innovations in fields like drug discovery with MECo (https://arxiv.org/pdf/2510.14455) from Tsinghua University, which translates natural language intentions into executable code for molecular optimization, exemplify the practical potential.

The development of robust evaluation frameworks like ColorBench (https://arxiv.org/pdf/2510.14621) for mobile agents and benchmarks like FinDeepResearch for financial analysis are crucial for systematically assessing and improving AI capabilities in complex, safety-critical domains. Addressing issues such as “Brain Rot” (https://arxiv.org/pdf/2510.13928) from Texas A&M and potential biases in drug-safety decisions (https://arxiv.org/pdf/2510.13931) is paramount for building truly reliable AI systems.

Looking forward, the integration of formal methods for agentic AI safety and security (https://arxiv.org/pdf/2510.14133) will become increasingly vital as LLM-powered agents become more autonomous. The ability to generate fair consensus statements using social choice theory (https://arxiv.org/pdf/2510.14106) and dynamically adapt to user needs with Just-In-Time Objectives (https://arxiv.org/pdf/2510.14591) points towards a future of highly personalized and ethical AI interactions. The rapid evolution of LLMs promises not just more intelligent systems, but smarter, safer, and more universally accessible AI.

Share this content:

Spread the love

Discover more from SciPapermill

Subscribe to get the latest posts sent to your email.

Latest 100 papers on large language models: Oct. 20, 2025

The Big Idea(s) & Core Innovations

Under the Hood: Models, Datasets, & Benchmarks

Impact & The Road Ahead

Discover more from SciPapermill

Reinforcement Learning’s New Frontier: From Agentic LLMs to Safe Robotics

Autonomous Driving’s Leap Forward: From Human-Aligned AI to 4D Scene Generation

Related Posts

Post Comment Cancel reply

Discover more from SciPapermill