In-Context Learning: Navigating Complexity, Enhancing Safety, and Unleashing New Capabilities in LLMs
Latest 35 papers on in-context learning: Jun. 20, 2026
In-context learning (ICL) has revolutionized how we interact with and understand large language models (LLMs), allowing them to adapt to new tasks and generate impressive outputs with just a few examples. However, this powerful paradigm isn’t without its complexities, from ensuring reliable knowledge to mitigating safety risks and pushing the boundaries of what LLMs can achieve. Recent research dives deep into these challenges, unveiling fascinating insights and proposing innovative solutions that promise to make ICL more robust, adaptable, and trustworthy.
The Big Idea(s) & Core Innovations:
One central theme emerging from these papers is the push beyond simplistic ICL applications towards more sophisticated, adaptive, and often multi-faceted approaches. We see a significant focus on enhancing reliability and reducing uncertainty, particularly in high-stakes domains. For instance, the paper, “Quantifying Aleatoric Uncertainty of In-Context Learning for Robust Measure of LLM Prediction Confidence” by Jinseok Chung and colleagues from POSTECH, introduces self-function vectors to decompose predictive uncertainty, offering a principled way to understand why an LLM is uncertain. Complementing this, “From Drift to Coherence: Stabilizing Beliefs in LLMs” by SongEun Kim and co-authors (from Seoul National University and KAIST), reveals that LLMs exhibit belief drift but self-stabilize, proposing Prompted Predictive Resampling (PPR) and a Self-Consistency (SC) loss to accelerate this process, leading to better calibrated uncertainty.
The challenge of knowledge conflict and unreliability is directly tackled by Huang Peng et al. from the National University of Defense Technology, China, in “Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference”. Their MACR framework uses adaptive knowledge assessment and multi-agent reasoning to explicitly resolve conflicts when both internal LLM knowledge and external context might be faulty, moving beyond simple source selection. For domain-specific tasks, “BCL: Bayesian In-Context Learning Framework for Information Extraction” by Haoliang Liu et al. (from HiThink Research and UCL), introduces a Bayesian particle filtering framework that systematically refines label representations for information extraction, achieving significant F1 improvements by optimizing fine-grained semantic subcategories.
Safety and alignment are critical concerns. Sihui Dai and Mann Patel from CapitalOne, in their work “What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?”, shed light on how safety-aligned LLMs interpret mixed compliance demonstrations, showing that benign and harmful examples are not interchangeable and that preference optimization (DPO) plays a key role in decoupling general cooperativeness from harmful compliance. Furthermore, they reveal a strong recency bias, where harmful demonstrations placed at the end of the context are most effective for ‘jailbreaking’. This contrasts with the subtle internal belief shifts explored by Benjamin Sturgeon et al. from MATS in “When Roleplaying, Do Models Believe What They Say?”, finding that role-play primarily shifts what models say rather than what they internally represent as true, unlike more profound emergent misalignments.
Several papers also push the boundaries of ICL in specialized domains and for complex tasks. For code generation, “Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning” by Shi Chen et al. (China University of Mining and Technology), highlights that supervised fine-tuning (SFT) outperforms non-parametric ICL for Solidity code generation, despite ICL’s rapid context saturation. In visual domains, “From Frames to Temporal Graphs: In-Context Egocentric Action Recognition with Vision-Language Models” by Bessie Dominguez-Dager et al. (University of Alicante, ETH Zürich, Microsoft), demonstrates that converting egocentric video into Temporal Action Graphs (TAG) enables VLMs to excel as symbolic reasoners, outperforming direct frame-based inference. This sentiment is echoed by Sunil Khatri and colleagues from Karlsruhe Institute of Technology in “Beyond Model Size: Probing the Gaps in Visual in-Context Learning by Training a Tiny Model”, who show that a tiny 1M parameter model can match or outperform much larger models, challenging the assumption that size alone guarantees true contextual adaptation.
Under the Hood: Models, Datasets, & Benchmarks:
The advancements in ICL are often coupled with innovations in the underlying models, datasets, and evaluation protocols:
- Uncertainty Quantification: The work by Chung et al. (“Quantifying Aleatoric Uncertainty of In-Context Learning for Robust Measure of LLM Prediction Confidence”) introduces a novel evaluation protocol for uncertainty decomposition, utilizing synthetic and real-world tasks (WordNetMCQ, AG News, Emotion, HellaSwag, GSM8K) and models like LLaMA2 (7B, 13B, 70B), Qwen2.5-7B, and Mistral-7B. Their code is available at https://github.com/LOG-postech/self-fv-icl.
- Bayesian ICL & Prior Adaptation: Qingyang Zhu et al. (“Multi-Task Bayesian In-Context Learning” from NYU) explore multi-task Bayesian ICL using the ERA5 climate data benchmark. Their code is at https://github.com/martianmartina/multi-task-bayesian-icl/.
- Code Generation & Benchmarking: Shi Chen et al. (“Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning” from China University of Mining and Technology) introduce SolidityBench (5,470 repository-level Solidity contracts) and SolidityScore (a domain-aware semantic evaluation metric), evaluating code LLMs using resources like OpenZeppelin, Synthetix, and Etherscan. Code is available at https://github.com/ChenS0827/SCG.
- Text-to-SQL for Astronomy: P.A. Estévez et al. (University of Chile) in “Querying an astronomical database using large language models: the ALeRCE text-to-SQL system” developed a system for the ALeRCE astronomical database and released a dataset of 110 NL/SQL pairs, evaluating 13 LLMs including Claude Opus 4.6 and Gemini models.
- Mobility Trajectory Generation: Siyu Li et al. (Emory University) introduce TrajGenAgent (“TrajGenAgent: A Hierarchical LLM Agent for Human Mobility Trajectory Generation”), a zero-shot hierarchical LLM-agent framework, and a behavior-aware evaluation framework with anomaly detectors (ICAD, BeSTAD) on NumoSim and MobilitySyn datasets. Code is at https://github.com/Emory-AIMS/TrajGenAgent.
- Grammatical Error Correction (GEC): Guangyue Peng et al. (Peking University) in “Encode Errors: Representational Retrieval of In-Context Demonstrations for Multilingual Grammatical Error Correction” introduce Grammatical Error Representations (GER) and an effective retriever, evaluated on datasets like W&I+LOCNESS and CoNLL-14. Code: https://github.com/viniferagy/GER.
- Visual ICL Benchmarking: Pradnya Halady et al. (“Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks” from Karlsruhe Institute of Technology) introduced VIBE (Visual In-Context BEnchmark), a comprehensive evaluation toolkit across 14 datasets, 12 tasks, and 106 combinations. Sunil Khatri et al. (also from KIT) further developed TinyVICL (a 1M parameter model) in “Beyond Model Size: Probing the Gaps in Visual in-Context Learning by Training a Tiny Model”, evaluating it against models up to 7,000x larger.
Impact & The Road Ahead:
These advancements herald a new era for ICL, moving beyond simple prompting to sophisticated, agentic, and self-improving systems. The ability to quantify and manage uncertainty, resolve knowledge conflicts, and systematically optimize ICL contexts will be critical for deploying LLMs in high-stakes environments like finance, healthcare, and critical infrastructure (e.g., building management systems with Brick-DICL from Amazon AWS Generative AI Innovation Center in “Brick-DICL: Dynamic In-Context Learning for Automated Brick Schema Classification”).
The insights into positional bias, cross-lingual dynamics, and the multi-mechanism nature of transformer predictions suggest that effective ICL requires a deep understanding of how LLMs process context, not just what context is provided. The development of self-improving frameworks like ELM (“Continual Self-Improvement with Lightweight Experiential Latent Memories” by Vaggelis Dorovatas et al. from Toyota Motor Europe) and agentic systems for complex tasks like code migration (“Agentic Framework for Deep Learning workload migration via In-Context Learning” by Qiyue Liang et al. from Google) and lookahead planning (“Fact-Augmented Lookahead Planning for LLM Agents” by Samuel Holt et al. from the University of Cambridge) point towards a future where LLMs can learn and adapt continuously and autonomously.
From bridging correctness and efficiency gaps in code translation with SWIFTTRANS (“Bridging Functional Correctness and Runtime Efficiency Gaps in LLM-Based Code Translation” by Longhui Zhang et al. from Harbin Institute of Technology) to enabling expressive pose control in image generation with Pose-ICL (“Pose-ICL: 3D-Aware In-Context Learning for Pose-Controllable Subject Customization” by Xuan Han et al. from Tongji University), ICL is becoming an indispensable tool for pushing AI’s capabilities. The road ahead involves further demystifying LLM internals, developing more robust evaluation metrics, and building truly adaptive systems that can seamlessly generalize across diverse, complex, and dynamic real-world scenarios. The pace of innovation in ICL suggests a future where AI systems are not just powerful, but also more reliable, interpretable, and aligned with human intent.
Share this content:
Post Comment