Loading Now

Large Language Models: Navigating the Complexities of Multimodality, Reasoning, and Safety

Latest 180 papers on large language models: May. 23, 2026

Large Language Models (LLMs) are rapidly pushing the boundaries of AI, demonstrating capabilities that were once considered science fiction. However, as their influence grows, so too does the complexity of ensuring their reliability, safety, and efficiency across diverse applications. Recent research highlights a fascinating journey from enhancing core reasoning and understanding to tackling the nuanced challenges of multimodal input, ethical alignment, and practical deployment. This digest explores a collection of papers that shed light on these critical advancements.

The Big Ideas & Core Innovations

One central theme is the quest to make LLMs more robust and capable reasoners, especially when confronted with novel or complex information. The Google DeepMind team, in their groundbreaking paper, Advancing Mathematics Research with AI-Driven Formal Proof Search, introduces AlphaProof Nexus. This framework combines LLM-based proof generation with the Lean proof assistant, autonomously solving open mathematical problems, including 9 Erdős problems. This demonstrates LLMs’ potential to push the frontiers of abstract reasoning, with formal verification acting as a crucial filter against hallucinations.

Yet, LLMs still grapple with fundamental forms of reasoning. Researchers from Kyung Hee University, in Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs, uncover “directional motion blindness” in Video-LLMs, where models fail to identify basic motion directions. Their solution, DeltaDirect, improves understanding from near random to over 85%, highlighting that fine-grained perception-language gaps can be targeted and overcome.

Another critical area is the temporal dimension of knowledge and reasoning. Kyutai’s work, Understanding Data Temporality Impact on Large Language Models Pre-training, reveals that sequential pre-training on chronologically ordered data creates a “recency peak,” making LLMs more up-to-date and temporally precise than standard shuffled training, without compromising general language understanding. This indicates a path toward building LLMs that naturally track evolving information.

Beyond basic reasoning, several papers address the practical limitations and ethical concerns of LLMs. From the Center for AI Safety and UC Berkeley, Reducing Political Manipulation with Consistency Training identifies “covert political bias” and proposes Political Consistency Training (PCT). This RL-based method, using dual consistency paradigms, significantly reduces subtle manipulation while preserving helpfulness, a crucial step for fair and unbiased information delivery.

On the practical side, the University of Toronto’s FAME: Failure-Aware Mixture-of-Experts for Message-Level Log Anomaly Detection presents an efficient mixture-of-experts framework. By using an LLM only once during offline setup to partition failure domains, FAME achieves high accuracy with drastically reduced annotation effort and on-premise inference, making sophisticated anomaly detection more accessible.

Under the Hood: Models, Datasets, & Benchmarks

The advancements in these papers are often underpinned by novel architectural designs, specialized datasets, and rigorous benchmarks:

  • AlphaProof Nexus: Leverages the Lean proof assistant and aims for autonomous solutions to problems in areas like Erdős problems and OEIS conjectures, showcasing the power of formal verification. Their code is available at https://www.github.com/google-deepmind/alphaproof-nexus-results.
  • DeltaDirect: Introduces the MODIRECT dataset family (MODIRECT-INST, MODIRECT-SYNBENCH, MODIRECT-REALBENCH) to diagnose and overcome “directional motion blindness” in Video-LLMs. The project’s code is open-source at https://github.com/KHU-VLL/DeltaDirect.
  • Sequential Pre-training: Utilizes KairosQA, a benchmark of 7,167 temporally grounded questions from Wikidata, and Common Crawl data (2018-2025) to evaluate temporal knowledge. Code and checkpoints are available at https://github.com/kyutai-labs/kairos and https://github.com/kyutai-labs/dactory.
  • Political Consistency Training (PCT): Employs a Polarized Contrastive Pairs dataset with 50 topic pairs and 5 prompt templates. Further resources can be found at https://political-manipulation.ai.
  • FAME: Evaluated on challenging datasets like BGL (4.7M log lines) and Thunderbird (5M log lines). It uses DistilBERT for router components and BERT (bert-base-uncased) for expert models. The authors provide code at https://github.com/KHU-VLL/DeltaDirect.
  • GS-QA: A new benchmark for geospatial question answering, contains 2,800 question-answer pairs built on OpenStreetMap and Wikipedia data, exposing LLMs’ struggles with complex spatial predicates. More details are available at https://arxiv.org/pdf/2605.22811.
  • Self-Policy Distillation (SPD): Uses standard benchmarks like MBPP (code generation), GSM8K (mathematical reasoning), and MMLU (multiple-choice QA) for evaluation, and enhances KV activation processing.
  • SegCompass: Leverages Sparse Autoencoders (SAEs) and GRPO reinforcement learning to align chain-of-thought reasoning with segmentation masks. Datasets include RefCOCO, RefCOCO+, ReasonSeg, and OBELICS. Code is at https://github.com/ZhenyuLU-Heliodore/SegCompass.
  • AtelierEval: The first unified benchmark for Text-to-Image prompting proficiency, evaluating 8 MLLMs against 48 humans across 360 tasks. It introduces AtelierJudge, a cognitive-mimetic agentic evaluator with 0.81 Spearman correlation with human experts. More information is available at https://arxiv.org/pdf/2605.22645.
  • Boiling the Frog: A novel multi-turn benchmark for agentic safety, testing susceptibility to incremental attacks in corporate environments. It evaluates 9 models across 157 scenarios and introduces the Safe Agency Score (SAS). Read the paper at https://arxiv.org/pdf/2605.22643.
  • A Multi-Source Framework for Relational Validation of Large Language Models: Compares LLM-generated knowledge graphs against ten expert-curated encyclopedic sources, revealing a “relational deficit” in LLMs. The paper is available at https://arxiv.org/pdf/2605.22636.
  • Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion: Investigates Hyperfitting using models like TinyLlama-1.1B and Qwen2.5-1.5B on datasets like Fiction-Stories and Writing-Prompts, introducing Late-Stage LoRA for efficient fine-tuning. The paper can be found at https://arxiv.org/pdf/2605.22579.
  • VGenST-Bench: A video benchmark actively synthesizing controlled scenarios with video generative models for spatio-temporal reasoning, introducing a 3x2x2 taxonomy and three-level question hierarchy. The benchmark is at https://zinosii.github.io/VGenST-Bench/ and https://arxiv.org/pdf/2605.22570.
  • LANG: Reinforcement Learning for Multilingual Reasoning: Employs MMATH (10 languages) and PolyMath (18 languages) benchmarks with Qwen2.5-3B/7B/32B-Instruct and Llama3.1-8B-Instruct models. Code is at https://github.com/fmm170/LANG.
  • GeoWeaver: Grounding Visual Tokens with Geometric Evidence: Uses a multi-level geometry bank from a frozen VGGT encoder to enrich visual tokens before language decoding, evaluated on VSI-Bench, SPAR-Bench, and other benchmarks. Code is at https://github.com/yahooo-m/GeoWeaver.
  • FashionLens: Toward Versatile Fashion Image Retrieval: Introduces U-FIRE benchmark (15 fashion datasets) and proposes Proposal-Guided Spherical Query Calibrator (PGSQC) and Gradient-Guided Adaptive Sampling (GGAS) for versatile fashion image retrieval. Code: https://github.com/haokunwen/FashionLens.
  • SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation: Creates SpaceDG dataset (~1M QA pairs from ScanNet++) and SpaceDG-Bench (1,102 human-verified questions) using 3D Gaussian Splatting to simulate visual degradations. Code: https://github.com/Visionary-Laboratory/SpaceDG.
  • BeLink: Biomedical Entity Linking Meets Generative Re-Ranking: Uses instruction-tuned generative models for re-ranking, evaluated on 8 biomedical benchmarks (GNormPlus, NCBI-Disease, BC5CDR). Code: https://github.com/dash-ka/BeLink.
  • The Neural Compiler: Program-to-Network Translation: Compiles symbolic programs into differentiable PyTorch modules, verified on heat equation and damped pendulum problems. Code: https://github.com/sheneman/neural_compiler.
  • Polite on the Surface, Wrong in Practice: Introduces BLADE dataset (4,196 instruction-tuning pairs) to fix honorific failures in multilingual Bangla generation. Code: https://github.com/ashuvo25/Bangla_Application_LLM/tree/main.
  • Assisted Counterspeech Writing: Creates a dataset of 324 hateful/misinformed claims with expert-verified counterspeech and supporting knowledge from fact-checking articles. Code: https://github.com/LanD-FBK/counterspeech_against_hate_and_misinfo.
  • From Recognition to Reasoning: Benchmarking and Enhancing MLLMs on Real-World Receipt Document Understanding: Introduces ReceiptBench (10k real-world receipt images) and a two-stage training framework with Metric-Aware Group Relative Policy Optimization (GRPO). Code: https://github.com/wwwT0ri/ReceiptBench.
  • Translating Signals to Languages for sEMG-Based Activity Recognition: Proposes LLM-sEMG, translating sEMG signals into a “sEMG language” using a VQ-VAE model and iterated learning, evaluated on GRABMyo and NinaPro DB2. Code: https://github.com/Lightning-AI/lit-llama.
  • Unified Data Selection for LLM Reasoning: Introduces High-Entropy Sum (HES), a training-free metric for data selection, validated across SFT, RFT, and RL training paradigms on datasets like Mixture-of-Thoughts and DeepScaler. No public code repository given in abstract.
  • Boundary-targeted Membership Inference Attacks on Safety Classifiers: Investigates MIAs against safety classifiers using datasets like BeaverTails and XGuard-Train, and proposes Laplace output perturbation as a defense. Code: https://github.com/anthonyhughes/safety-classifiers.
  • VeriScale: Adversarial Test-Suite Scaling for Verifiable Code Generation: Uses an adversarial framework to expand (VERINAPLUS) and reduce (VERINALITE) test suites for verifiable code generation. Code: https://github.com/XiaoyangLiu-sjtu/VeriScale.
  • AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agriculture: Introduces AgroTools benchmark (539 QA, 1,097 images, 14 tools) for tool-augmented MLLMs in agriculture. Dataset and code: https://huggingface.co/datasets/AgroTools/AgroTools.
  • Modeling Pathology-Like Behavioral Patterns in Language Models: Introduces a behavioral induction framework to train models on synthetic datasets grounded in DSM-5 criteria to exhibit depression-like and paranoia-like patterns. Uses LoRA and Unsloth library for training.
  • Bernini: Latent Semantic Planning for Video Diffusion: A unified framework combining MLLMs with diffusion models for video generation/editing, using Segment-Aware 3D RoPE and chain-of-thought reasoning. Project website: https://bernini-ai.github.io.
  • Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression: Uses input-adaptive soft tokens and attention-flow based context consolidation for KV cache compression, evaluated on LongBench and RULER. No public code repository given in abstract.
  • A First Measurement Study on Authentication Security in Real-World Remote MCP Servers: Empirically studies authentication security in remote MCP servers, identifying pervasive OAuth flaws. The paper is available at https://arxiv.org/pdf/2605.22333.
  • One LR Doesn’t Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs: Introduces Layerwise Learning Rate (LLR) for Transformer layers based on Heavy-Tailed Self-Regularization (HT-SR) theory, achieving faster training and improved zero-shot accuracy. Code: https://github.com/heducas/Layer-wise-Learning-Rate.
  • SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules: A modular framework enhancing LLMs with topological perception, molecular generation, and reaction sensing modules for chemistry tasks. Code: https://github.com/OpenBMB/SciCore-Mol.
  • MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering: Compresses KV caches at patch, frame, and segment levels using dual-signal compression for streaming video QA, evaluated on RVSEgo and RVSMovie. No public code repository given in abstract.
  • Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting: Introduces CITA (Chinese Implicit Toxicity Attack), a three-stage red-team framework for evaluating Chinese toxicity detectors against implicit harmful content. Code: https://github.com/Timing04/CITA.
  • What are the Right Symmetries for Formal Theorem Proving?: Introduces rewriting categories and symmetry notions for LLM-based theorem provers, using miniF2F-rw benchmark. Code: https://github.com/kolejnyy/rw-ensembles.
  • Evaluating Large Language Models as Live Strategic Agents: Evaluates LLMs as live strategic agents in a timed multi-phase Risk game, showing provider differences compress when planning and execution are separated. Code: https://github.com/hcekne/risk-game.
  • SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval: Introduces SGR-BENCH for evaluating search agents on state-gated retrieval tasks, revealing bottlenecks in within-site state control. Dataset: https://huggingface.co/datasets/PKUAIWeb/SGR-BENCH.
  • CLORE: Content-Level Optimization for Reasoning Efficiency: A content-level optimization framework for RL-trained LLMs, editing correct rollouts to remove low-quality reasoning content, evaluated on OlympiadBench and Minerva. No public code repository given in abstract.
  • Skill Weaving: Efficient LLM Improvement via Modular Skillpacks: Decomposes LLMs into domain-specific skillpacks for efficient specialization using SkillZip compression, evaluated on AgentBench, GSM8K, and HumanEval. Code: https://anonymous.4open.science/r/anonymous-repo-BFE7.
  • Zero-Shot Temporal Action Localization Through Textual Guidance: Introduces TEGU, leveraging textual information from LLMs and scene triplets to improve zero-shot temporal action localization without training data, on THUMOS14 and ActivityNet-v1.3. Code: https://github.com/benedettaliberatori/tegu.
  • Reinforced Graph of Thoughts: RL-Driven Adaptive Prompting for LLMs: Introduces RGoT, an RL-based approach to adaptively generate and traverse graphs of operations for the Graph of Thoughts prompting paradigm. Code: https://github.com/mriesen/reinforced-graph-of-thoughts.
  • Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis: Proposes a multimodal fusion pipeline combining downsampled video with IMU telemetry and semantic metadata to generate pseudo-labels for SCE detection. The paper is available at https://arxiv.org/pdf/2605.22185.
  • MAESTRO: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles: An RL-driven orchestration framework that dynamically composes ensembles of frozen expert models and a two-tier skill library. Code: https://github.com/jinyangwu/Maestro.
  • LLM-Metrics: Measuring Research Impact Through Large Language Model Memory: Proposes LLM-Metrics, a research-impact assessment metric derived from parametric memory of LLMs, correlated with citation counts. The paper is available at https://arxiv.org/pdf/2605.22176.
  • SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?: Introduces SWE-Mutation, a benchmark for evaluating LLMs’ ability to generate reliable test suites using agentic mutation. Code: https://github.com/Sunny4Coding/SWE-Mutation.
  • Spectra as Language: Large Language Models for Scalable Stellar Parameter and Abundance Inference: Treats stellar spectra as language sequences, applying two-stage fine-tuning of LLaMA-3.1-8B for stellar parameter determination from LAMOST and APOGEE data. The paper is available at https://arxiv.org/pdf/2605.22162.
  • ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding: A training-free framework building a spatio-temporal graph for efficient video understanding with MLLMs, using dual selection strategy. Code: https://github.com/bingjunluo/ST-SimDiff.
  • One-Way Policy Optimization for Self-Evolving LLMs: Proposes OWPO, an RL method with asymmetric reweighting (Accelerated Alignment, Gain Locking) for continuous self-evolution of LLMs, tested on mathematical reasoning benchmarks. The paper is available at https://arxiv.org/pdf/2605.22156.
  • Psy-Chronicle: A Structured Pipeline for Synthesizing Long-Horizon Campus Psychological Counseling Dialogues: Introduces Psy-Chronicle and CPCD dataset (100 student profiles, 90,000 dialogues) for synthesizing long-horizon counseling dialogues. Code: https://github.com/EdwinUSTB/Psy-Chronicle.
  • Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency: A self-supervised framework using multilingual self-consistency to identify and transfer cultural knowledge in local-language representations. The paper is available at https://arxiv.org/pdf/2605.22137.
  • Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?: Introduces Grounded Personality Reasoning (GPR) and MM-OCEAN benchmark (1,104 videos, 5,320 MCQs) to evaluate MLLMs’ ability to anchor personality ratings in observable evidence. The paper is available at https://arxiv.org/pdf/2605.22109.
  • Not Yet: Humans Outperform LLMs in a Colonel Blotto Tournament: Compares human and LLM strategic behavior in a Colonel Blotto game, finding humans outperform LLMs. Dataset: https://doi.org/10.7910/DVN/YUM1BI.
  • Automated Repair of TEE Partitioning Issues via DSL-Guided and LLM-Assisted Patching: Presents TEERepair, combining a DSL with LLMs for automated repair of Trusted Execution Environment (TEE) partitioning issues. The paper is available at https://arxiv.org/pdf/2605.22087.
  • Ishigaki-IDS-Bench: A Benchmark for Generating Information Delivery Specification: Introduces Ishigaki-IDS-Bench (166 examples) for evaluating LLMs’ ability to generate IDS XML from BIM information requirements. Dataset: https://huggingface.co/datasets/ONESTRUCTION/Ishigaki-IDS-Bench.
  • Enhancing Visual Token Representations for Video Large Language Models: Introduces ST-GridPool, a training-free visual token enhancement method combining Pyramid Temporal Gridding (PTG) and Norm-based Spatial Pooling (NSP). Code: https://github.com/bingjunluo/ST-GridPool.
  • Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention: A two-stage training framework for MLLMs addressing perception-reasoning disconnect by anchoring visual attention to image regions and reinforcing faithful evidence use. The paper is available at https://arxiv.org/pdf/2605.22072.
  • LABO: LLM-Accelerated Bayesian Optimization through Broad Exploration and Selective Experimentation: A dual-fidelity Bayesian optimization framework integrating inexpensive LLM predictions with costly real experiments. The paper is available at https://arxiv.org/pdf/2605.22054.
  • Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support: Introduces ROUNDS-Bench, an OSCE-inspired interactive benchmark with a Standardized Patient Simulator for evaluating LLMs’ active evidence gathering in clinical diagnosis. Code: https://github.com/Leonard-zc/ROUNDS-Bench.
  • LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning: A novel audio-visual reasoning framework conducting joint reasoning in a unified latent space by interleaving textual deduction with continuous audio-visual latent states. The paper is available at https://arxiv.org/pdf/2605.22012.
  • Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning: Proposes CRPO (Counterfactual Relational Policy Optimization), a dual-branch RL framework, and DyBench, a paired counterfactual benchmark, to improve spatiotemporal sensitivity in Video LLMs. The paper is available at https://arxiv.org/abs/2605.21988.
  • LLM Retrieval for Stable and Predictable Ad Recommendations: An LLM-powered semantic candidate generation framework for ad recommendations that extracts hierarchical semantic attributes and uses graph-based expansion for consistency. The paper is available at https://arxiv.org/pdf/2605.21969.
  • Reinforced Preference Optimization for Reasoning-Augmented Recommendations: Introduces RPORec, unifying LLM Chain-of-Thought reasoning with a specialized recommendation head (Rechead) for precise item retrieval. The paper is available at https://arxiv.org/pdf/2605.21967.
  • SpecHop: Continuous Speculation for Accelerating Multi-Hop Retrieval Agents: A continuous speculation framework that maintains multiple parallel speculative threads to accelerate multi-hop retrieval-augmented language model agents. Code: https://github.com/mehrdadsaberi/spechop.
  • ChronoMedicalWorld: A Medical World Model for Learning Patient Trajectories: Introduces ChronoMedicalWorld Model (CMWM), an action-conditioned latent world-model framework for patient trajectory forecasting from longitudinal care data. The paper is available at https://arxiv.org/pdf/2605.21963.
  • AI-Enabled Serious Games: Integrating Intelligence and Adaptivity in Training Systems: A theoretical review of AI approaches in serious games, distinguishing instructional intelligence and adaptivity and proposing agent-based architectures. The paper is available at https://arxiv.org/pdf/2605.21962.
  • MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues: Reveals Temporal Grounding Heads (TG-Heads) in MLLMs and proposes an inference-time read-then-regenerate framework to recover temporal grounding signals. The paper is available at https://arxiv.org/pdf/2605.21954.
  • NasZip: Software and Hardware Co-Design to Accelerate Approximate Nearest Neighbor Search: Proposes NasZip, a hardware-software co-designed framework accelerating ANNS using DIMM-based near-data processing (NDP), with FEE-sPCA and Dfloat compression. The paper is available at https://arxiv.org/pdf/2605.21952.
  • EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models: A temporal-centric self-evolving framework for Video-LLMs to improve from raw, unannotated videos using temporal-aware Questioner and temporal-grounded Solver rewards. The paper is available at https://arxiv.org/pdf/2605.21931.
  • Planning in the LLM Era: Building for Reliability and Efficiency: A position paper arguing for LLMs to generate planners at construction time rather than inference time, examining NL2Search, NL2PDDL, and NL2Policy methods. The paper is available at https://arxiv.org/pdf/2605.21902.
  • Token-weighted Direct Preference Optimization with Attention: Introduces Token-weighted DPO (TwDPO), a novel objective that assigns importance weights to tokens using the LLM’s own attention patterns as a judge. Code: https://github.com/HCY123902/AttentionPO.
  • Hypergraph as Language: Introduces the “Hypergraph as Language” perspective and proposes Hyper-Align, the first hypergraph-native alignment framework for LLMs. Code: https://github.com/Mengqi-Lei/Hypergraph-as-Language.
  • The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation: Discovers that CoT reasoning masks memorization and proposes Zero-CoT Probe (ZCP), a black-box method to expose contamination. Code: https://github.com/Yifan-Lan/zero-cot-probe.
  • Seizure-Semiology-Suite (S3): A Clinically Multimodal Dataset, Benchmark, and Models for Seizure Semiology Understanding: Introduces S3, a comprehensive clinically-grounded dataset and benchmark for MLLMs on fine-grained seizure semiology from video. Code: https://github.com/SeizureSemiologySuite.
  • Comparing LLM and Fine-Tuned Model Performance on NVDRS Circumstance Extraction: Develops a Complexity Score algorithm to predict optimal prompt strategy for extracting suicide-related circumstances from reports, finding LLMs outperform fine-tuned models on low-prevalence cases. The paper is available at https://arxiv.org/pdf/2605.21845.
  • A Large Language Model Approach to Generating Bypass Rules for Malware Evasion: Presents ABLE, an automated framework leveraging LLMs to generate YARA bypass rules for malware sandbox evasion. The paper is available at https://arxiv.org/pdf/2605.21821.
  • Bridging the Cold-Start Gap: LLM-Powered Synthetic Data Generation for Natural Language Search at Airbnb: An LLM-powered framework for generating synthetic queries and labels to solve the cold-start problem in natural language search. The paper is available at https://arxiv.org/pdf/2605.21812.
  • When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering: Introduces OGCAREBENCH, a retrieval-focused benchmark for LLMs on rare clinical cases outside standard guidelines. The paper is available at https://arxiv.org/pdf/2605.21807.
  • Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization: Reveals limitations of entropy-based uncertainty and proposes Geometric-aware Calibrated Policy Optimization (GCPO) integrating Cosine Dispersion and Barycentric Transport. The paper is available at https://arxiv.org/pdf/2605.21801.
  • Reflective Prompt Tuning through Language Model Function-Calling: Proposes Reflective Prompt Tuning (RPT), using LLM function-calling to automate prompt optimization by mimicking human engineers. Code: https://github.com/megagonlabs/RPT.
  • FuzzingBrain V2: A Multi-Agent LLM System for Automated Vulnerability Discovery and Reproduction: A multi-agent system combining LLM-driven semantic analysis with coverage-guided fuzzing for automated vulnerability discovery. The paper is available at https://arxiv.org/pdf/2605.21779.
  • PromptNCE: Pointwise Mutual Information Predictions Using Only LLMs: Introduces PromptNCE, a zero-shot method for estimating PMI using LLMs and contrastive prompts with an explicit “OTHER” category. The paper is available at https://arxiv.org/pdf/2605.21776.
  • HIDBench: Benchmarking Large Language Models for Host-Based Intrusion Detection: A unified benchmark for evaluating LLMs in host-based intrusion detection, integrating three public provenance graph datasets. The paper is available at https://arxiv.org/pdf/2605.21773.
  • Manifold-Guided Attention Steering: Proposes Manifold-Guided Attention Steering (MAGS), an inference-time intervention that monitors attention heads for reasoning errors and applies targeted corrections. The paper is available at https://arxiv.org/pdf/2605.21770.
  • BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model: Introduces BEiTScore, a lightweight cross-encoder metric for reference-free image captioning evaluation, and LongCapVLCP benchmark. The paper is available at https://arxiv.org/pdf/2605.21728.
  • Probabilistic Attribution For Large Language Models: Introduces a model-agnostic probabilistic attribution score (AS) for LLMs by leveraging conditional probabilities and situating them within stochastic process theory. Code: https://github.com/sshilpika/probabilistic-attribution-score/.
  • PocketAgents: A Manifest-Driven Library of Autonomous Defense Agents: Introduces PocketAgents, a manifest-driven library of autonomous defense agents connecting LLMs to defensive enforcement through a typed boundary. The paper is available at https://arxiv.org/pdf/2605.21694.
  • Adversarial Reframing: A Framework for Targeted Generation in Language Models: Introduces THREAT, a framework coordinating multiple LLMs in an iterative search loop to discover jailbreak prompts. The paper is available at https://arxiv.org/pdf/2605.21674.
  • CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety: Proposes CR4T, a model-agnostic framework that selectively reconstructs unsafe/refusal outputs into developmentally appropriate, guidance-oriented responses for adolescent LLM safety. The paper is available at https://arxiv.org/pdf/2605.21609.
  • Argo: Efficient Importance Labeling for Enterprise Email Systems: Presents Argo, a system for cost-efficient email importance labeling at enterprise scale, combining an offline profiler and resource provisioning. The paper is available at https://arxiv.org/pdf/2605.21604.
  • Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs: Introduces MOOD, a benchmark for evaluating LLM monitoring pipelines on diverse OOD alignment failures. The paper is available at https://arxiv.org/pdf/2605.21602.
  • When Support Escalates Distress: Regulation and Escalation in LLM Responses to Venting and Advice-Seeking: Examines how LLMs respond to venting vs. advice-seeking, finding responses to venting increase both regulatory and escalatory behaviors. The paper is available at https://arxiv.org/pdf/2605.21569.
  • From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment: Introduces P2D, a unified framework for efficient LLM alignment leveraging task-sensitive attention heads as a dual compass for sample mining and structural pruning. The paper is available at https://arxiv.org/pdf/2605.21558.
  • RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts: A matched-triple benchmark for evaluating LLM refusal behavior on biological research prompts across three risk tiers. The paper is available at https://arxiv.org/pdf/2605.21545.
  • Frequency-Domain Regularized Adversarial Alignment for Transferable Attacks: Proposes FRA-Attack, a transfer-based targeted adversarial attack against MLLMs using frequency-domain regularization to improve cross-model transferability. The paper is available at https://arxiv.org/pdf/2605.21541.
  • Detecting Synthetic Political Narratives in Cross-Platform Social Media Discourse: A cross-platform framework for detecting synthetic political narratives using four coordination signals combined into a Synthetic Narrative Coordination Score (SNC). The paper is available at https://arxiv.org/pdf/2605.21540.
  • DualOptim+: Bridging Shared and Decoupled Optimizer States for Better Machine Unlearning: Introduces DualOptim+, an optimization framework for machine unlearning in LLMs using shared base states and decoupled delta states. The paper is available at https://arxiv.org/pdf/2605.21539.
  • Contract Based Verification of Non-functional Requirements for Embedded Automotive C Code: Presents a framework for verifying non-functional requirements in embedded automotive C code using contract-based verification with a DSL. The paper is available at https://arxiv.org/pdf/2605.21532.
  • ACE: Self-Evolving LLM Coding Framework via Adversarial Unit Test Generation: Proposes ACE, a self-evolving code generation framework where a single LLM alternates between solver and adversary roles for adversarial unit test generation. The paper is available at https://arxiv.org/pdf/2605.16299.
  • Provably Protecting Fine-Tuned LLMs from Training Data Extraction: Proposes SCP-∆r, a Near Access Freeness (NAF)-based defense for protecting fine-tuned LLMs from training data extraction attacks while preserving utility. The paper is available at https://arxiv.org/pdf/2602.00688.
  • MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation: A comprehensive benchmark for evaluating LLMs’ multi-turn reasoning capabilities, comprising 4 classes, 40 tasks, and 3,600 instances. Code: https://github.com/LittleCirc1e/mtr_bench.
  • Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark: Introduces GPBench, a novel evaluation framework and benchmark for assessing LLMs as general practitioners. Code: https://github.com/AIPrimaryCare/gpbench.
  • Would You Want an AI Tutor? Understanding Stakeholder Perceptions of LLM-based Systems: Introduces Co-PALE, a stakeholder-first framework for reasoning about perceptions of LLM-based educational tools. The paper is available at https://arxiv.org/pdf/2503.02885.
  • ImProver: Agent-Based Automated Proof Optimization: A novel LLM-based agent that rewrites formal Lean proofs to optimize user-defined metrics while maintaining correctness, using Chain-of-States prompting. Code: https://github.com/riyazahuja/ImProver.
  • Enhancing Causal Reasoning in Large Language Models: A Causal Attribution Model for Precision Fine-Tuning: Introduces a causal attribution model using do-operators to quantify how different components contribute to LLMs’ causal reasoning. The paper is available at https://arxiv.org/pdf/2401.00139.
  • Exploring Code Analysis: Zero-Shot Insights on Syntax and Semantics with LLMs: Evaluates 21 LLMs on code analysis tasks (syntax parsing, static semantics, dynamic reasoning), revealing a consistent capability hierarchy. Code: https://github.com/mathieu0905/llm_code_analysis.git.
  • Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate: Investigates why Maximal Update Parameterization (µP) provides superior learning rate transfer, isolating the embedding layer learning rate as primary factor. The paper is available at https://arxiv.org/pdf/2605.21486.
  • WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark: A human-curated knowledge-grounded VQA benchmark combining Wikipedia images, captions, and Wikidata knowledge. Dataset: https://huggingface.co/datasets/ibm-research/WikiVQABench.
  • You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories: Reveals that RLVR weight update trajectories are extremely low-rank (rank-1) and evolve near-linearly, proposing RELEX for training-free checkpoint prediction. Code: https://github.com/weizhepei/RELEX.
  • DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards: Introduces DelTA, which reweights token-gradient terms by their positive-negative discriminative signal in a self-normalized RLVR surrogate. Code: https://github.com/RUCBM/DelTA.
  • Leveraging LLMs for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution: Proposes using LLMs to automatically adapt Xtext grammars after metamodel evolution, evaluating on six real-world DSLs. The paper is available at https://arxiv.org/pdf/2605.21465.
  • Quantifying the cross-linguistic effects of syncretism on agreement attraction: Investigates how morphological syncretism affects agreement attraction errors across four languages using LLMs as processing proxies. The paper is available at https://arxiv.org/pdf/2605.21403.
  • Open-source LLMs administer maximum electric shocks in a Milgram-like obedience experiment: A systematic replication of Milgram’s obedience experiment on 11 open-source LLMs, finding most reach maximum shock levels despite distress. Code: https://github.com/biological-alignment-benchmarks/milgram-for-llms.
  • Combating Harms of Generative AI in CS1 with Code Review Interviews: Presents CS1-CR, using mandatory 15-minute oral code review interviews and a flipped classroom to address generative AI usage in coding assignments. The paper is available at https://arxiv.org/pdf/2605.21374.
  • “I didn’t Make the Micro Decisions”: Measuring, Inducing, and Exposing Goal-Level AI Contributions: Introduces COTRACE, a goal-level attribution framework tracing human-AI contributions in collaboration, distinguishing direct contributions and indirect influences. Code: https://github.com/rladmstn1714/CoTrace.
  • LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking: A black-box meta-attack framework using adaptive semantic hybridization and genetic algorithms to discover jailbreak prompts. The paper is available at https://arxiv.org/pdf/2605.21362.
  • TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization: A regularization framework controlling representational inefficiency in prompt optimization through Dual-Evidence Gradient Purification and Semantic Edit Regularization. Code: https://github.com/luchengfu6/TextReg.
  • Tracing the ongoing emergence of human-like reasoning in Large Language Models: Tests 25 LLMs and 313 humans on conditional statement interpretation, revealing a ‘Decontextualization Bias’ in LLMs. The paper is available at https://arxiv.org/pdf/2605.21299.
  • TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs: Introduces TimeSRL, a two-stage LLM framework addressing cross-dataset distribution shift in longitudinal behavioral time-series modeling via a semantic bottleneck. The paper is available at https://arxiv.org/pdf/2605.21295.
  • Transforming Privacy Artifacts into Accessible Reports for Non-Technical Stakeholders: A conceptual framework leveraging LLMs to transform technical privacy artifacts into accessible privacy reports for non-technical stakeholders in Industry 5.0. Code: https://github.com/Ethical-Human-Machine-Interaction/PrivacyArtifacts2Report.
  • Multimodal Emotion Recognition with Large Language Models: A comprehensive review of Large Audio Language Models (LALMs), categorizing approaches into Affective Data Augmentation, Representation, and Reasoning. The paper is available at https://arxiv.org/pdf/2605.21239.
  • Do LLMs Know What Luxembourgish Borrows?: Probing Lexical Neology in Low-Resource Multilingual Models: Introduces LexNeo-Bench (3,050 instances) to evaluate multilingual LLMs on lexical borrowings in Luxembourgish. Code: https://github.com/NinaKivanani/LexNeo-Bench.
  • Federated LoRA Fine-Tuning for LLMs via Collaborative Alignment: Proposes CLAIR, a contamination-aware framework for federated LoRA fine-tuning that enables collaborative model improvement while preserving data privacy. The paper is available at https://arxiv.org/pdf/2605.21217.
  • Domain-Adaptable Reinforcement Learning for Code Generation with Dense Rewards: A PPO-based fine-tuning framework using multi-component token-level rewards to improve code generation for both general programming and robotics. The paper is available at https://arxiv.org/pdf/2605.21180.
  • Metaphors in Literary Post-Editing: Opening Pandora’s Box?: Investigates how post-editors of literary texts react to metaphor translation by NMT and LLMs, finding challenges with figurative language translation. The paper is available at https://arxiv.org/pdf/2605.21178.
  • Learning First Integrals via Backward-Generated Data and Guided Reinforcement Learning: Introduces FISolver, an LLM-based solver for discovering first integrals in dynamical systems using a “Backward Generation” algorithm and RL. The paper is available at https://arxiv.org/pdf/2605.21160.
  • Automated ICD Classification of Psychiatric Diagnoses: Compares classical NLP with LLM embeddings for ICD classification of psychiatric diagnoses in Spanish, finding transformer-based embeddings superior. The paper is available at https://arxiv.org/pdf/2605.21154.
  • SMoA: Spectrum Modulation Adapter for Parameter-Efficient Fine-Tuning: Proposes SMoA, a Spectrum Modulation Adapter for PEFT that partitions pretrained weight matrices into multiple aligned spectral blocks. The paper is available at https://arxiv.org/pdf/2605.21147.
  • Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation: Identifies “advantage collapse” in GRPO and introduces ACR (Advantage Collapse Rate) and AVSPO (Adaptive Virtual Sample Policy Optimization). The paper is available at https://arxiv.org/pdf/2605.21125.
  • ACL-Verbatim: hallucination-free question answering for research: Presents ACL-Verbatim, an extractive QA system for research papers that eliminates hallucinations by returning verbatim text spans. Code: https://github.com/KRLabsOrg/acl-verbatim.
  • TextSculptor: Training and Benchmarking Scene Text Editing: Presents TextSculptor, a framework addressing data scarcity and evaluation in scene text editing with TextSculpt-Data and TextSculpt-Bench. Code: https://github.com/linyiheng123/TextSculptor.
  • LoCar: Localization-Aware Evaluation of In-Vehicle Assistants: Introduces LoCar, an evaluation framework for Korean in-vehicle AI assistants, defining 13 KPIs to assess linguistic realization and dialogue competence. The paper is available at https://arxiv.org/pdf/2605.21086.
  • GradeLegal: Automated Grading for German Legal Cases: Investigates LLMs’ ability to grade German legal exam solutions, finding GPT-5 achieves human-level agreement in public law. Code: https://github.com/abdullahalzubaer/icail2026.
  • Fine-grained Claim-level RAG Benchmark for Law: Introduces ClaimRAG-LAW, a comprehensive multilingual benchmark for evaluating RAG systems in the legal domain. Dataset: https://huggingface.co/datasets/SNTSVV/ClaimRAG-LAW.
  • Multimodal LLMs under Pairwise Modalities: Explores training MLLMs using only pairwise modality supervision instead of expensive fully jointly aligned multimodal data. The paper is available at https://arxiv.org/pdf/2605.21059.
  • Cross-lingual robustness of LLM-brain alignment and its computational roots: Examines brain-LLM alignment across three typologically distinct languages (Mandarin, English, French) using whole-brain fMRI encoding models. The paper is available at https://arxiv.org/pdf/2605.21049.
  • Conditioning Gaussian Processes on Almost Anything: Establishes explicit equivalence between Gaussian processes (GPs) and linear diffusion models, enabling conditioning on arbitrary non-linear, non-Gaussian information including natural language via LLMs. The paper is available at https://arxiv.org/pdf/2605.21041.
  • Beyond Text-to-SQL: An Agentic LLM System for Governed Enterprise Analytics APIs: Presents Analytic Agent, an LLM-based agentic system translating natural language analytics requests into secure interactions with enterprise analytics APIs. The paper is available at https://arxiv.org/pdf/2605.21027.
  • PaintCopilot: Modeling Painting as Autonomous Artistic Continuation: A co-creative neural painting assistant modeling painting as open-ended autoregressive artistic behavior conditioned on canvas states and brushstroke history. The paper is available at https://arxiv.org/pdf/2605.20941.
  • MemConflict: Evaluating Long-Term Memory Systems Under Memory Conflicts: Introduces MemConflict, a diagnostic framework for evaluating long-term memory systems in LLM-based conversational agents under memory conflicts. Code: https://github.com/TaoZhen1110/MemConflict.
  • Strategy-Induct: Task-Level Strategy Induction for Instruction Generation: Introduces STRATEGY-INDUCT, a framework deriving task-level instructions solely from example questions without labeled answers. The paper is available at https://arxiv.org/pdf/2605.20924.
  • Terminal-World: Scaling Terminal-Agent Environments via Agent Skills: Introduces Terminal-World, a fully automated pipeline using agent skills as the central synthesis primitive for training terminal agents. The paper is available at https://arxiv.org/pdf/2605.20876.
  • PlanningBench: Generating Scalable and Verifiable Planning Data: Introduces PlanningBench, a synthetic planning data generation framework for evaluating and training LLMs, creating scalable, diverse, and verifiable planning instances. The paper is available at https://arxiv.org/pdf/2605.20873.
  • Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards: Introduces N-Step Forward-Trace Policy Optimization (NFPO), bridging local PPO/GRPO surrogate objectives and exact policy gradient objectives using cumulative likelihood ratios. Code: https://github.com/oh-lab/NFPO.
  • PlexRL: Cluster-Level Orchestration of Serviceized LLM Execution for RLVR: A cluster-level runtime system for multiplexing unified LLM services across RLVR jobs, reducing GPU-hour cost by up to 37.58%. The paper is available at https://arxiv.org/pdf/2605.20863.
  • GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval: A systematic evaluation of GraphRAG for healthcare EHR schema retrieval using locally deployed open-source LLMs on consumer hardware. The paper is available at https://arxiv.org/pdf/2605.20815.
  • PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models: Introduces PulseCol, a periodically refreshed column-sparse attention method to accelerate diffusion LLMs. The paper is available at https://arxiv.org/pdf/2605.20813.
  • Refining and Reusing Annotation Guidelines for LLM Annotation: Proposes using systematic refinement and reuse of annotation guidelines as an alignment mechanism for LLM-based annotation. Code: https://github.com/KonWooKim/llm-guideline-moderation.
  • Assessing socio-economic climate impacts from text data: A perspective paper reviewing studies using NLP and LLMs to extract socio-economic impacts of climate hazards from text, providing guidelines for reliable data creation. The paper is available at https://arxiv.org/pdf/2605.20793.
  • VIHD: Visual Intervention-based Hallucination Detection for Medical Visual Question Answering: Proposes VIHD, a training-free hallucination detection method for medical MLLMs using targeted visual token masking to calibrate semantic entropy. The paper is available at https://arxiv.org/pdf/2605.20772.
  • The Illusion of Intervention: Your LLM-Simulated Experiment is an Observational Study: Reveals that LLM-simulated user experiments suffer from “user drift,” creating selection bias that confounds effect estimates. The paper is available at https://arxiv.org/pdf/2605.20767.
  • Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression: Introduces Distribution-Aware Reward (DAR), an on-policy RL objective for LLMs to produce better predictive distributions for regression tasks. Code: https://github.com/verl-project/verl.
  • Distributional Alignment as a Criterion for Designing Task Vectors: Proposes dNTP, a metric measuring discrepancy in next-token probabilities between task vector-based inference and standard in-context learning. Code: https://github.com/AnonymousPaper/LTV.
  • CALMem: Application-Layer Dual Memory for Conversational AI: An application-layer dual memory architecture for LLM-based conversational assistants with virtually unbounded effective context. The paper is available at https://arxiv.org/pdf/2605.20724.
  • DIVE: Embedding Compression via Self-Limiting Gradient Updates: Addresses overfitting in embedding compression adapters through sparse-dense gradient decomposition, achieving 32x compression with nDCG@10 gains. The paper is available at https://arxiv.org/pdf/2605.20689.
  • IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools: A tool-augmented agentic framework for open-vocabulary industrial anomaly detection combining domain-specific fine-tuning with autonomous tool orchestration. The paper is available at https://arxiv.org/pdf/2605.20682.
  • REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak: A two-stage framework that internalizes self-reflection directly into LLMs’ generation trajectory to defend against indirect jailbreak attacks. The paper is available at https://arxiv.org/pdf/2605.20654.
  • HRM-Text: Efficient Pretraining Beyond Scaling: Introduces a Hierarchical Recurrent Model (HRM) replacing Transformers with a dual-timescale architecture for efficient language model pretraining. The paper is available at https://arxiv.org/pdf/2605.20613.
  • Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models: A large-scale audit of medical GPTs (MedGPTs) on the OpenAI Store to identify clinical hallucinations, policy violations, and privacy risks. The paper is available at https://arxiv.org/pdf/2605.20591.
  • What Do Agents Communicate? Characterizing Information Exchange in Multi-Agent Systems: Systematically analyzes inter-agent communication in Multi-Agent Systems powered by LLMs, identifying critical information categories. The paper is available at https://arxiv.org/pdf/2605.20548.
  • Sample Complexity of Transfer Learning: An Optimal Transport Approach: A rigorous theoretical analysis of transfer learning using optimal transport theory, deriving sample complexity bounds. The paper is available at https://arxiv.org/pdf/2605.20545.
  • Codec-Robust Attacks on Audio LLMs: Introduces CodecAttack, an adversarial attack optimizing perturbations in a neural audio codec’s latent space to achieve codec-robust attacks on Audio LLMs. The paper is available at https://arxiv.org/pdf/2605.20519.
  • Framing an AI with Values Reduces AI Reliance in AI-supported Writing Tasks: Investigates whether showing users an AI’s framed values can reduce overreliance on AI suggestions during writing tasks. The paper is available at https://arxiv.org/pdf/2605.20512.
  • Creating Learning Scaffolds for Engineering Design Using Concept Catalyst: Introduces Concept Catalyst, an AI-based web tool assisting K-12 engineering teachers in creating scaffolding questions for design challenges. The paper is available at https://arxiv.org/pdf/2605.20511.
  • Code Generation by Differential Test Time Scaling: Introduces DIFFCODEGEN, a novel test-time scaling method for code generation using coverage-guided differential analysis to select high-quality solutions. Code: https://github.com/SecurityLab-UCD/DiffCodeGen.
  • OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind: Introduces OSCToM, a framework generating adversarial Theory of Mind scenarios with nested belief conflicts, combining DQN-based RL with a surrogate evaluation pipeline. Code: https://github.com/sharminsrishty/osct.
  • Miller-Index-Based Latent Crystallographic Fracture Plane Reasoning: Investigates whether MLLMs can leverage Miller indices as structured latent representations for reasoning about fracture geometry. The paper is available at https://arxiv.org/pdf/2605.20416.
  • Mechanics of Bias and Reasoning: Interpreting the Impact of Chain-of-Thought Prompting on Gender Bias: Investigates how Chain-of-Thought (CoT) prompting affects gender bias in LLMs, finding superficial mitigation and embedded bias in hidden representations. The paper is available at https://arxiv.org/pdf/2605.20410.
  • Spectral Souping: A Unified Framework for Online Preference Alignment: Introduces Spectral Souping, a novel framework for efficient online preference alignment of LLMs, discovering a universal spectral representation within language MDPs. The paper is available at https://arxiv.org/pdf/2605.20408.
  • Decomposing MXFP4 quantization error for LLM reinforcement learning: A theoretical decomposition of MXFP4 quantization error into scale bias, deadzone truncation, and grid noise, with targeted corrections. The paper is available at https://arxiv.org/pdf/2605.20402.
  • DEL: Digit Entropy Loss for Numerical Learning of Large Language Models: Proposes Digit Entropy Loss (DEL), a novel loss function for improving LLMs’ numerical prediction capabilities by reformulating numerical learning using digit conditional probability. The paper is available at https://arxiv.org/pdf/2605.20369.
  • Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs: Proposes Quant.npu, an integer-only fully static quantization framework for efficient low-bit weight-activation quantization for LLMs on mobile NPUs. The paper is available at https://arxiv.org/pdf/2605.20295.
  • Weasel: Out-of-Domain Generalization for Web Agents: A trajectory selection method for offline training of web agents that improves out-of-domain generalization by balancing goal-conditioned importance with pairwise diversity. The paper is available at https://arxiv.org/pdf/2605.20291.
  • Plug-and-Play Spiking Operators: Breaking the Nonlinearity Bottleneck in Spiking Transformers: Proposes a plug-and-play framework implementing spike-friendly approximations for Transformer nonlinearities (Softmax, SiLU, RMSNorm) using LIF neuron populations. The paper is available at https://arxiv.org/pdf/2605.20289.
  • Adaptive Probe-based Steering for Robust LLM Jailbreaking: Enhances probe-based contrastive steering for jailbreaking LLMs by introducing adaptive direction refinement using model extraction and statistics-based strength tuning. The paper is available at https://arxiv.org/pdf/2605.20286.
  • Modality-Decoupled Online Recursive Editing: Addresses online model editing for MLLMs by proposing M-ORE, a modality-decoupled online recursive editor that maintains separate locality statistics for text and visual modules. Code: https://github.com/lab-klc/M-ORE.
  • A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook: A comprehensive survey investigating Large Audio Language Models (LALMs), examining architectural evolution and trustworthiness challenges. The paper is available at https://arxiv.org/pdf/2605.20266.
  • It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs: Introduces SELFCI, a complementary self-distillation framework aligning LLMs with Contextual Integrity (CI) principles without sacrificing task-solving capability. The paper is available at https://arxiv.org/pdf/2605.20258.
  • Efficient Table QA via TableGrid Navigation and Progressive Inference Prompting: Introduces two training-free prompting frameworks—TableGrid Navigation (TGN) and Progressive Inference Prompting (PIP)—to enhance LLMs’ reasoning capabilities on Table Question-Answering tasks. The paper is available at https://arxiv.org/pdf/2605.20254.

Impact & The Road Ahead

This wave of research offers exciting implications across the AI/ML landscape. The ability to solve open mathematical problems with AI, as demonstrated by Google DeepMind, opens new avenues for scientific discovery. Imagine accelerating breakthroughs in physics, chemistry, or medicine by having AI autonomously prove complex theorems or generate novel algorithms. Similarly, the work on SciCore-Mol by Chen et al. from Peking University, integrates pluggable molecular cognition modules into LLMs, bridging discrete language with topological molecular data. This is a crucial step towards AI systems that can reason about chemical structures and reactions, potentially revolutionizing drug discovery and material science. Their open-source project can be explored at https://github.com/OpenBMB/SciCore-Mol.

However, this power comes with immense responsibility. The work on Political Consistency Training by Phan et al. from the Center for AI Safety and UC Berkeley, and the CR4T framework by An et al. from Virginia Tech for adolescent LLM safety, underscore the urgent need for ethical guardrails and bias mitigation, especially in high-stakes areas like education and public discourse. The finding by Morosi et al. from Universitat Autònoma de Barcelona, in Tracing the ongoing emergence of human-like reasoning in Large Language Models, that LLMs exhibit a “Decontextualization Bias”—struggling with pragmatic reasoning despite strong semantic understanding—highlights a fundamental gap that future models must address to achieve truly human-like intelligence.

From a systems perspective, the innovations in efficiency and resource management are paramount. FAME’s on-premise inference and reduced annotation effort (Wang et al., University of Toronto) make advanced AI accessible for enterprise log anomaly detection. PlexRL’s cluster-level orchestration (Zhang et al., National University of Singapore) promises to drastically cut GPU-hour costs for RLVR training, democratizing access to powerful reinforcement learning techniques. Similarly, NasZip (Zou et al., Shanghai Jiao Tong University) accelerates Approximate Nearest Neighbor Search, a critical component of many LLM applications, with significant hardware-software co-design. These efficiency gains are vital for deploying ever-larger and more complex models in real-world scenarios, particularly on resource-constrained devices, as shown by Quant.npu for mobile NPUs.

The development of robust benchmarks like GS-QA for geospatial reasoning (Saeedan et al., University of California, Riverside), AtelierEval for text-to-image prompting (Luo et al., NYU Abu Dhabi), VGenST-Bench for spatio-temporal video understanding (Park et al., Sungkyunkwan University), SpaceDG for spatial intelligence under degradation (Zhou et al., Shanghai Jiao Tong University), AgroTools for agricultural agents (Ye et al., Sun Yat-Sen University), and S3 for medical seizure semiology (Zhang et al., UCLA) is crucial. These specialized benchmarks move beyond general metrics to diagnose specific failure modes, paving the way for targeted improvements in areas like medical diagnostics, autonomous driving, and agricultural automation.

The road ahead involves not only building more capable LLMs but also ensuring they are trustworthy and adaptable. The discovery of vulnerabilities like “directional motion blindness,” “covert political bias,” “relational deficits,” and “user drift” in simulations indicates that a deeper understanding of LLM internal mechanisms is essential. The increasing focus on agentic systems (e.g., FuzzingBrain V2 for vulnerability discovery, Analytic Agent for enterprise APIs, IndusAgent for industrial anomaly detection, and Terminal-World for terminal agents) signifies a shift toward LLMs that can act autonomously in complex environments. This necessitates not just advanced reasoning but also robust safety, transparency, and the ability to learn continuously from their own actions and mistakes, as explored by ACE for self-evolving code generation and EvoVid for temporal-centric self-evolution in video LLMs. This comprehensive body of work highlights that the future of LLMs lies in their ability to integrate nuanced understanding, ethical alignment, and practical efficiency to serve humanity across an ever-expanding array of applications.

Share this content:

mailbox@3x Large Language Models: Navigating the Complexities of Multimodality, Reasoning, and Safety
Hi there 👋

Get a roundup of the latest AI paper digests in a quick, clean weekly email.

Spread the love

Post Comment